Pattern discovery in biological sequences (e.g., DNA sequences) is one of the most challenging tasks in computational biology and bioinformatics. So far, in most approaches, the number of occurrences is a major measure of determining whether a pattern is interesting or not. In computational biology, however, a pattern that is not frequent may still be considered very informative if its actual support frequency exceeds the prior expectation by a large margin. In this paper, we propose a new interesting measure that can provide meaningful biological information. We also propose an efficient index-based method for mining such interesting patterns. Experimental results show that our approach can find interesting patterns within an acceptable computation time.

Biological sequences, such as DNA sequences, have a great number of contiguous patterns consisting of frequent items. Which patterns are interesting to biologists? Is a pattern that occurs more frequently more interesting? So far, in most approaches, the number of occurrences has been a major measure of determining an interesting pattern. This measure, however, is not enough to discriminate a pattern from the background noise and may induce much time to spend for checking patterns of no biological meaning. Researchers are more interested in contiguous patterns that are statistically significant than those simply occurring frequently. The aim of mining interesting patterns is to analyze the important biological functions hidden in the extremely large amounts of genomic sequences. In this work, we aimed to discover surprising contiguous patterns that occur at a frequency higher than their expected frequencies. To find such surprising patterns with confidence, we chose to use a more suitable measure,

Many works for sequential pattern mining take an a priori approach, such as Agrawal and Srikant [

Another efficient algorithm, MacosVSpan [

Recently, Zerin et al. [

Let Σ = {A, C, G, T} be a set of DNA alphabets, where A, C, G, and T are called DNA characters or bases. A DNA sequence _{1}_{2}_{l}_{i}_{1}_{2}_{n}

A _{1}, a_{2}, ..., a_{n}〉 is called a contiguous sub-sequence of another sequence β = 〈b_{1}, b_{2}, ..., b_{m}〉, and β is a contiguous super-sequence of α, denoted as α⊆β, if there exists integers 1 ≤ _{1}_{2}_{n}_{i+1}_{i}_{1} = b_{j1}, a_{2} = b_{j2}, ..., a_{n} = b_{jn}. We can also say that α is

Given a pattern

Given a pattern _{1}_{2}_{q}_{1}P_{2}_{1}_{1}P_{2}, P_{1})_{1}P_{2},D)/Sup(P_{1},D)

For example, the character "A" occurs 10 times and "AT" occurs 7 times, and in database

Given a pattern _{1}_{2}_{q}_{i}_{i}, D)_{i}

For example, the

The information carried by a DNA character or base in DNA sequence database _{|C|}

For example, the occurrence probability of character A in _{|C|}_{4}(0.182)

Given a pattern _{1}_{2}_{q}_{|C|}_{1})_{2})_{q})

For example, the

Given a pattern _{1}_{2}_{q}

For example, the ^{*} 5 =

Given a sequence database

The method for mining surprising contiguous patterns from a DNA sequence database proceeds as follows. For constructing the fixed length spanning tree, we follow the method suggested by Kang et al. [

The second step generates fixed-length patterns from the constructed spanning tree with satisfying the minimum information gain threshold and minimum confidence threshold, and expands sequences by joining candidates and checking the frequency and starting position of the candidates. At the time of joining the sequences with the same length to generate the candidates of the next length, it checks whether the second candidate starts right to the next position of the first one with the same ID or not. If so, it increases the frequency counter by one. Finally, it checks the minimum information gain and minimum confidence threshold. If the pattern satisfies both thresholds, then it is added to the next-length candidate pattern. This process is recursively followed for all the candidates with the same length to generate the next-length surprising candidates.

The spanning tree is shown in

To generate the length-5 surprising contiguous patterns, we need to join the length-4 surprising patterns; all the items in the first pattern, excluding the first item, should be same as all the items of the second pattern, excluding the last item. The frequency counter of the next length pattern will increase if both of the patterns are present in the same sequence and the second pattern's starting positions are the right next to the position of the first pattern in the same sequence. If the generated patterns satisfy the

The pattern TCGT starts from 3rd and 10th positions of sequence 20, 3rd position of sequence 30, and 1st position of sequence 40. As a result, the length-5 candidate ATCGT starts from the 2nd and 9th positions of sequence 20 and the 2nd position of sequence 30. So, the information gain of ATCGT is 14.065, and the confidence of ATCGT with respect to ATCG is 0.6, which satisfy both

Following the same process, the obtained length-6 surprising patterns are 〈CGTGAT〉, 〈CATCGT〉, and 〈TCGTGA〉, and the obtained length-7 surprising pattern is 〈TCGTGAT〉. Here, we scan the whole database only once to construct the fixed-length spanning tree and then never scan the database again. On the other hand, using an index-based method and

We compare the performance of the proposed approach with Zerin et al. [

In the first experiment, we compare the memory usage of our approach and Zerin et al. [

The memory consumption of our proposed approach and [

The second experiment examined mining time performance according to change in sequence length.

The third experiment shows the effect of information gain threshold on mining time to find out the surprising contiguous patterns up to length 8. In this experiment, we take

The fourth experiment studied the impacts of varying minimum confidence from 0.2 to 0.5 for random datasets and 0.25 to 0.55 for real datasets. This time, we take

To summarize, we have developed an index-based method, where we need to scan the database only once to mine the surprising contiguous patterns in biological data sequences, which are considered very important in bioinformatics and computational biology. Our aim is not to discover the patterns that occur often but rather to find patterns that are surprising by introducing a new threshold information gain. It has been shown by the experimental results that the proposed approach is very efficient in finding interesting patterns within a reasonable computation cost. For future work, we will try to optimize the proposed approach by considering a variety of environments with different parameters and also consider promoting new measurement parameters, which is very feasible for describing the sequences in a biological substance.

This work was supported by the National Research Foundation (NRF) grant (No. 2011-0018264) of the Ministry of Education, Science and Technology (MEST) of Korea.

Fixed length scanning method.

Surprising contiguous pattern mining algorithm.

Index-based fixed-length spanning tree.

Mining length-4, length-5, length-6, and length-7 surprising patterns.

Memory usage w.r.t. change of sequence length.

Memory usage w.r.t. change of

(A, B) Impact of pattern length in mining time.

(A, B) Impact of information gain threshold on mining time.

(A, B) Impact of confidence threshold on mining time.

Example of a DNA sequence database

ID | Sequence |
---|---|

10 | ATCGGTGACTATCG |

20 | CATCGTTCATCG |

30 | CATCGTGAAGT |

40 | TCGTGATTG |

50 | GCGTGATTC |