### Introduction

*information*[1], which is widely studied and used in the field of communication. In information theory, if a pattern is expected to occur frequently based on some prior knowledge or by chance, then an occurrence of that pattern carries less information. Thus, we can use information to test the surprise of an occurrence of a pattern. The

*information gain*is introduced to denote the accumulated information of a pattern in a DNA sequence and is used to exhibit the degree of surprise of the pattern.

*downward closure property*to prune infrequent patterns, which says that if a pattern is infrequent, all of its superpatterns must be infrequent. It suffers from the level-wise difficulty for candidate generation-and-test and needs several database scans for sequential pattern mining. A typical Apriori-like approach such as Generalized Sequential Patterns (GSP) [3] is a good example of this category. An efficient algorithm, PrefixSpan [4], has been proposed, representing projection-based sequential pattern mining. This approach examines only the prefix subsequences and projects only their corresponding postfix subsequences into the databases. Sequential patterns are grown by exploring length-1 frequent patterns in each projected database. Using the projected database, however, every expansion of sequential patterns needs a recursive process, which is not effective for DNA sequence mining. The problem of finding the frequent maximal contiguous pattern from sequences more than two has been introduced in [5-8]. In addition, Yang et al. [9] have proposed an interesting technique to find periodic patterns in a sequence of events. Lu et al. [10] have proposed a pattern discovery algorithm that can identify over-represented patterns inside DNA sequences by introducing a new measurement system.

### Methods

### Concepts and definitions

*S*is an ordered list of DNA alphabets.

*S*is denoted by 〈

*c*,

_{1}*c*, ...,

_{2}*c*〉, where

_{l}*c*∈ Σ and |

_{i}*S*| denotes the length of sequence

*S*. A sequence with length

*n*is called an

*n*-sequence. A sequence database

*D*is a set of tuples 〈

*sid*,

*S*〉, where

*sid*is a sequence ID. The sum of the lengths of all sequences in

*D*is denoted as |

*D*|=|

*S*|+|

_{1}*S*|...|

_{2}*S*|

_{n}* Definition 1 (Patter)*

*pattern*is a contiguous sub-sequence of DNA sequence

*S*drawn from Σ = {A, C, G, T}. A sequence α = 〈a

_{1}, a

_{2}, ..., a

_{n}〉 is called a contiguous sub-sequence of another sequence β = 〈b

_{1}, b

_{2}, ..., b

_{m}〉, and β is a contiguous super-sequence of α, denoted as α⊆β, if there exists integers 1 ≤

*j*≤

_{1}*j*≤ ... ≤

_{2}*j*≤ m and

_{n}*j*=

_{i+1}*j*+ 1 for 1 ≤

_{i}*i*≤

*n-1*such that a

_{1}= b

_{j1}, a

_{2}= b

_{j2}, ..., a

_{n}= b

_{jn}. We can also say that α is

*contained*by β. In our paper, we use the term "(sub)-sequence" to describe "contiguous (sub)-sequence" in brief.

* Definition 2 (Support)*

*P*and a sequence

*S*, the number of occurrences of

*P*in

*S*is called the

*support*of pattern

*P*in sequence

*S*, denoted as

*Sup(P, Si)*. For DNA sequence database

*D*, the support of

*P*in

*D*is defined as .

* Definition 3 (Confidence)*

*P*=

*P*,

_{1}*P*, ...,

_{2}*P*and a DNA sequence database D, the confidence of

_{q}*P*with respect to

_{1}P_{2}*P*is defined as

_{1}*Conf(P*=

_{1}P_{2}, P_{1})*Sup(P*.

_{1}P_{2},D)/Sup(P_{1},D)*D*in Table 1,

*Conf*(AT,A) = 7/10 = 0.7.

* Definition 4 (Pattern probability)*

*P*=

*P*,

_{1}*P*, ...,

_{2}*P*(

_{q}*P*is a DNA alphabet) and a DNA sequence database

_{i}*D*, the

*pattern probability*of

*P*in

*D*is defined as , where

*Pr(P*= # of occurrences of an alphabet

_{i}, D)*P*/|

_{i}*D*|.

*pattern probability*of pattern "ATCG" in Table 1 is

*Pr(ATCG,D)*=

*Pr(A,D)*×

*Pr(T,D)*×

*Pr(C,D)*×

*Pr(G,D)*=

*(10/55)*×

*(18/55)*×

*(12/55)*×

*(15/55)*=

*0.182*×

*0.372*×

*0.218*×

*0.273*=

*0.00403*.

* Definition 5 (Information)*

*D*is defined as

*I(c)*= -

*log*

_{|C|}

*Pr(c,D)*, where |C| is the number of distinct characters in

*D*and

*Pr(c)*is the probability of

*c*occurs in

*D*.

*Pr(A,D)*= # of

*occurence(A)*/|

*D*|. So, the probability of character A is

*Pr(A,D)*=

*10/55*=

*0.182*in our example database. Then, the

*information*of character A in

*D*is,

*I(A)*= -

*log*

_{|C|}

*Pr(A,D)*= -

*log*=

_{4}(0.182)*1.228*.

* Definition 6 (Pattern information)*

*P*=

*P*,

_{1}*P*, ...,

_{2}*P*and a DNA sequence database D, the

_{q}*pattern information*of

*P*in

*D*is defined as

*I(P)*= -

*log*

_{|C|}

*Pr(P,D)*=

*I(P*+

_{1})*I(P*+......+

_{2})*I(P*.

_{q})*pattern information*of pattern "ATCG" is

*I(ATCG,D)*=

*I(A,D)*+

*I(T,D)*+

*I(C,D)*+

*I(G,D)*=

*1.228*+

*0.713*+

*1.098*+

*0.9365*=

*3.9755*in Table 1.

* Definition 7 (Information gain)*

*P*=

*P*,

_{1}*P*, ...,

_{2}*P*and a DNA sequence database

_{q}*D*, the

*pattern information*gain of

*P*in

*D*is defined as

*IG(P)*=

*I(P)*×

*Support(P)*.

*information gain*of pattern "ATCG" is

*IG(ATCG,D)*=

*3.9755*

^{*}5 =

*19.8775*in Table 1.

### Surprising pattern-mining algorithm

*min_in_gain*and

*min_conf*, we consider the new pattern as a next length surprising candidate. The obtained length-5 surprising patterns are 〈ATCGT〉, 〈TCGTG〉, 〈TCGTT〉, 〈CGTGA〉, 〈CATCG〉, and 〈GTGAT〉. From Fig. 2, we see pattern ATCG starts from the 1st and 11th positions of sequence 10, 2nd and 9th positions of sequence 20, and 2nd position of sequence 30.

*min_in_gain*and

*min_conf*thresholds.

*min_in_gain*and

*min_conf*thresholds, our proposed approach consumes less memory and faster execution than Zerin et al. [13].

### Results and Discussion

*min_in_gain*= 35(%) and minimum confidence threshold,

*min_conf*= 30(%). From this experiment, we can see that the proposed approach has efficient improvement over Zerin et al. [13]. When the sequence length becomes longer, it shows better performance in comparison with the existing algorithm.

*min_in_gain*over a real DNA sequence database is shown in Fig. 6. The

*x-axis*in the graph indicates the change in

*min_in_gain*as a percentage of the data point. A tree with a fixed length-10 was constructed using the aforementioned real datasets, and

*min_conf*value 0.35 is taken. Fig. 6 indicates that for increasing the value of

*min_in_gain*for both approaches, fewer candidates are generated and less memory is required.

*min_in_gain*= 35% and

*min_conf*= 30%. On the other hand, we performed the same experiment with real DNA sequence datasets, where we used the value of

*min_in_gain*= 45% and

*min_conf*= 40%, which is shown in Fig. 7B. From Fig. 7, we can see that our proposed approach could mine the surprising contiguous patterns within a reasonable computation cost.

*min_conf*0.3 and 0.4 for the random and real datasets, respectively. Fig. 8 indicates that increasing the information gain threshold decreases mining time for both random and real datasets.

*min_in_gain*0.3 and 0.4 for random and real datasets, respectively. Fig. 9 shows our proposed performance with different values for minimum confidence and illustrates a non-linear effect. The greatest change along the confidence axis is from 0.2 to 0.3 for random datasets and 0.25 to 0.35 real datasets. In most cases, the characters are evenly distributed. This means that the A, C, G, and T occur at almost the same ratio in the dataset as the frequency of each character, which is approximately 25%. So, if the

*min_conf*is set to 0.3, most random patterns will be filtered out, and real patterns occurring more frequently than 25% of the time will survive and be extended.