### Introduction

*de novo*the major combinatorial and spatial patterns of marks [1, 2]. The 15-chromatin-state model of the ENCODE Project consists of 15 states that are publicly available through 9 Browser Extensible Data (BED) files [3]. Since, large-scale epigenetic datasets such as ENCODE have become publicly available, a growing interest has been shown in predicting the function of non-coding DNA regions directly from sequence by utilizing these large-scale ChromHMM annotations [4–7].

*dominant*chromatin state for each 200-bp unit. We then rebuilt newer Markov chains by iteratively analyzing the

*variability count*of the chromatin states of each 200-bp unit. By eliminating the highly variable 200-bp units, in our simulation studies we finally analyzed the active chromatin states that showed a strong Markov property.

### Methods

### Combining 9 BED files into a single file

*chrom*(name of the chromosome),

*chromStart*(starting position of the feature in the chromosome),

*chromEnd*(ending position of the feature in the chromosome), and

*state*(15 chromatin states, numbered from 1 to 15).

### Filtering out highly variable 200-bp units

*variability count*of the chromatin states of a given 200-bp unit as the number of states where counts of occurrences were non-zeroes, to define and compare the observed consistency of each chromatin state at any given genomic position across all 9 epigenomes. For example, the chromatin state variability count of the chr1: 12,800–13,000 block in Fig. 2 would be four, as there are four non-zero states (i.e., 7, 9, 10, and 11), whereas the chromatin state variability count of the chr1: 10,200–10,400 block in Fig. 2 would be one, as there is only one non-zero state.

### Building fifth order Markov models

*P*(

*X*

_{n-5}*, X*

_{n-4}*, X*

_{n-3}*, X*

_{n-2}*, X*

*) for 4,096 components as well as a matrix of transitional probabilities*

_{n-1}*P(X*

_{n}*| X*

_{n-5}*, X*

_{n-4}*, X*

_{n-3}*, X*

_{n-2}*, X*

*) with a size of 4,096 × 4. These tables were used to build a global Markov chain classifier to explore and rank sub-optimal predictions of the chromatin states. Based on the nucleotide frequency profiles, given a random sequence*

_{n-1}*x*

_{1},

*x*

_{2},⋯,

*x*

_{200}in the state of a cell line, we compared sequences

*π*

_{1},

*π*

_{2},⋯,

*π*

_{200}of chromatin states that maximized the following probability of the initial 15 Markov chain models, where a

_{πiπi}_{＋1}is a transition probability:

*variability count*of the chromatin states of a given 200-bp unit, and by eliminating the highly variable 200-bp units in training.

*Promoter*,

*Enhancer*,

*Insulator*,

*Transition*,

*Repressed*, and

*Inactive*. Our final transition tables for the

*Promoter*,

*Enhancer, Insulator, Transition*and

*Repressed*state (excluding inactive states) were built from 121,500, 701,636, 89,844, 4,023,295, and 155,411 200-bp units, respectively. As these Markov chains could be used as a Naive Bayes classifier, we calculated the sequence of each 200-bp unit that maximized our Markov models. We defined a correctly predicted unit as one in which the predicted result matched one of the dominant chromatin states in the same broad state.

### Results

*Promoter*states, 37.95% precision for

*Transcribed*states, and 59.82% for

*Enhancer*states. These percentages were obtained by adding all units that were predicted correctly as a dominant state in each of the 200-bp units divided by the number of all testing units in the same broad group.

*Promoter*states showed reasonable Markov property, the

*Repressed*state did not seem to display Markov property, and those units related to the

*Enhancer*states (4, 5, 6, and 7 states) were the most tissue specific, whereas those related to the

*Transcription*states (9, 10, and 11 states) were highly constitutive.

### Discussion

*dominant*state, yet. Therefore, our study should only be considered from a computational perspective, and, is thus a preliminary work. Still, it is important to note that we only used DNA sequences contained in the epigenetic datasets in modeling the Markov chains. Our study showed that once a dominant state for each 200-bp unit is assigned, a generalizable Markov framework can be achieved. Based on the framework, we showed that some subsets of the active chromatin states possessed a strong Markov property. We are currently investigating the overall co-occurrence of the 200-bp chromatin states for ENCODE ChromHMM datasets together with Roadmap Genomics datasets [11].