### Introduction

### Methods

### Concepts and definitions

*S*is an ordered list of DNA alphabets.

*S*is denoted by 〈

*c*,

_{1}*c*...

_{2}*c*〉 where

_{l}*c*∈ Σ and |

_{i}*S*| denotes the length of sequence

*S*. A sequence with length

*n*is called an

*n*-sequence. DNA sequence database

*D*is a set of tuples 〈

*Sid*,

*S*〉 where

*Sid*is a sequence identifier and S is the corresponding sequence.

_{1}, a

_{2},..., a

_{n}〉 is called a contiguous sub-sequence of another sequence β = 〈b

_{1}, b

_{2}, ..., b

_{m}〉, and β is a contiguous super-sequence of α, denoted as α⊆β, if there exists integers 1 ≤

*j*≤

_{1}*j*≤ ... ≤

_{2}*j*≤ m and

_{n}*j*=

_{i+1}*j*+ 1 for 1 ≤

_{i}*i*≤

*n-1*such that a

_{1}= b

_{j1}, a

_{2}= b

_{j2}, ..., a

_{n}= b

_{jn}. We can also say that α is

*contained*by β. In our paper, we use the term "(sub)-sequence" to describe a "contiguous (sub)-sequence" in brief. A contiguous frequent sub-sequence

*X*is said to be maximal if none of its super-sequences

*Y*is frequent.

*D*and a minimum support threshold δ, the problem of maximal contiguous frequent sub-sequence mining is to find the complete set of maximal contiguous frequent patterns from that database. For example, let our running DNA sequence database be D in Table 1, and the minimum support threshold δ ≥ 2. Sequence 〈ATCGTGACT〉 is 9-sequence since its length is 9. Sequence 〈ATCG〉 is a contiguous frequent sub-sequence because it is contained by sequences 10, 20, and 30. S = 〈CGTGATT〉 is a frequent contiguous sub-sequence of length 7 since both sequences 40 and 50 contain it. Moreover, it is one of the maximal frequent contiguous sub-sequences, because there is no contiguous frequent super-sequence of 〈CGTGATT〉 with a minimum support of 2.

* Definition (Projection)*

* Definition (Projected database)*

* Lemma (Projected database)*

* Proof*

### Maximal contiguous frequent suffix tree algorithm

### An illustrative example

### MCFS algorithm on extra large DNA sequence database

_{1}and D

_{2}. Sequences 10, 20, and 30 are in D

_{1}, and 40 and 50 are in D

_{2}. Suffix ACT occurs once in D1, so it is not frequent. According to them [12, 13], we have to discard it for local support pruning. Another copy of ACT is in D

_{2}(ID-50), so we discard it from the partition as well. If we consider globally, then ACT will be a frequent pattern, if the minimum support threshold is 2. In this way, many frequent patterns can be lost by partitioning.

_{1}, D

_{2}, ..., D

_{n}are the partitions of the original database. F

_{i}is the contiguous frequent patterns, and IF

_{i}is the infrequent contiguous patterns from each partition. CFP is contiguous frequent patterns, and MCFP

_{i}is maximal contiguous frequent patterns. F

_{i}, IF

_{i}, CFP, and MCFP

_{i}are all stored on the disk.

### Results and Discussion

*Homo sapiens*GRCh37.64 DNA Chromosome Part 1, 2, 3) and 'Bacteria DNA sequence dataset' were downloaded from the NCBI website (http://www.ncbi.nlm.nih.gov/nuccore/). The human genome database contains 112,000 sequences, with sequence length 60. The bacteria dataset consists of 20,000 sequences, with sequence length 1,040.

*Homo sapiens*GRCh37.64 DNA Chromosome Part 1, 2, and 3. Part 2 has 98,000 sequences and a length of 60, and part 3 has 105,000 sequences with sequence length of 60.

*Homo sapiens*GRCh37.64 DNA Chromosome Parts 1, 2, and 3. We assume that Parts 1, 2, and 3 are partitioned and stored on the disk. With various settings of minimum support threshold, we measured the run-time performance (Fig. 8).