The non-coding DNA in eukaryotic genomes encodes a language which programs chromatin accessibility, transcription factor binding, and various other activities. The objective of this short report was to determine the impact of primary DNA sequence on the epigenomic landscape across 200-base pair genomic units by integrating nine publicly available ChromHMM Browser Extensible Data files of the Encyclopedia of DNA Elements (ENCODE) project. The nucleotide frequency profiles of nine chromatin annotations with the units of 200 bp were analyzed and integrative Markov chains were built to detect the Markov properties of the DNA sequences in some of the active chromatin states of different ChromHMM regions. Our aim was to identify the possible relationship between DNA sequences and the newly built chromatin states based on the integrated ChromHMM datasets of different cells and tissue types.
In 2011, the Encyclopedia of DNA Elements (ENCODE) consortium released the ChromHMM chromatin state annotations for 9 consolidated epigenomes, where ChromHMM is software developed by ENCODE labs, to integrate multiple chromatin datasets of various histone modifications to discover
On the other hand, many researchers have shown that formal language theory is an appropriate tool in analyzing various biological sequences [
Our simulation studies showed that some of these chromatin states possessed strong Markov properties of DNA sequences, and could even be predicted by the naïve Bayesian classifier. However, our model could have been biased, as our n-gram analyses were conducted only on two of the cell lines.
Thus, as a follow-up to our preliminary study on ENCODE datasets [
A generalizable framework can be achieved through statistically-justifiable models. We downloaded BED files from ENCODE and combined all the annotations spread out through 9 different BED files, into a single integrated BED file. Based on the newly integrated BED file, we assigned a
When making 15-state ChromHMM BED files, the ENCODE consortium uses a core set of 9 chromatin markers [
The ENCODE consortium released a 15-state model BED file from an analysis of consolidated epigenomes, resulting in a total of 9 epigenomes for public download in ChromHMM BED files [
For example, the chromatin state of Gm12878 shown at the bottom of
We attempted to build comparative nucleotide frequency profiles to detect their Markov property. Thus, it became critical to devise a functional annotation framework that can be generalized to different cell types. To design good predictive models in building the Markov chain atlas of the human genome, we modified the original BED files by dissecting the ChromHMM blocks in each BED file into 200-bp units. When the size of a dissected unit near the ChromHMM boundary is less than 150-bp, we discarded the unit, whereas when the size of dissected unit was greater than 150-bp, we rounded it up to a 200-bp unit.
For example, the original Gm12878 block in
Dissecting the blocks uniformly made it possible to combine all the annotations spread out through 9 different BED files into a single integrated BED file [
After integrating the BED files, we defined the
We could then use variability count statistics and maximum likelihood decision rule to create an optimal classification Markov model, as uniform priors can be assumed if 200-bp units with stable chromatin state are used. In this way, the highly variable 200-bp units, with which different chromatin states were frequently switched to other states across different tissues and cell types, could be eliminated in the training Markov transition tables. Our rationale behind this was that, compared to the highly variable 200-bp units, the 200-bp units that are less frequently changed would show a strong Markov property.
When the human genome was dissected into a 200-bp unit, there were originally 14,148,124 units (see
This means that most of these 200-bp units have strong preferences for certain dominant chromatin state, where a dominant state of a 200-bp unit is the most frequently annotated chromatin state among the 15 chromatin states. This provided good heuristic insight for designing new Markov models for our study. Thus, it was possible to assign only one or two dominant chromatin states for most of the 200-bp units of the entire human genome.
After we assigned a dominant chromatin state for each 200-bp unit, frequency counts were used to build fifteen initial transition tables for the fifth order Markov models [
By trial and error, we rebuilt newer Markov chains by iteratively analyzing the
By this process, we found that some inactive chromatin states were highly constitutive and marked in most of the 9 epigenomes. For example, state 13 (Hetero_Chromatin state), which covered on average 70.48% of each reference epigenome, was excluded when considering the variability count of the chromatin states. We also excluded units in which a transcribed state showed both promoter and enhancer signatures. Mostly, we profiled each 200-bp with chromatin states and built new transition tables by training the 200-bp blocks with a chromatin variability of less than 2 (and containing at least one active state).
These fifteen chromatin states were then merged into six broad states:
As a means to proving Markov property, we directly investigated whether our sequence-based Markov chain models for each chromatin state have the discriminating power necessary to identify different chromatin states.
The samples were stratified according to chromosomes into strictly non-overlapping training and testing sets. A total of 6,334,977 200-bp units were trained, and 703,886 200-bp units were tested for prediction accuracy. At this time, reverse complements of sequences were not considered when building the Markov models, since the backward Markov chain could show similar properties for solutions.
By estimating the prediction accuracy of chromatin states, we infer that the
In this short report, we did not provide any interpretable biological meanings for our statistically defined
Chromatin states of 9 cell lines from chr21: 33,031,600 to chr21: 33,041,600, shown in University of California Santa Cruz (UCSC) genome browser (GRCh37/hg19): the 15 chromatin states shown in the 4th field are numbered and abbreviated as: 1_Active_Promoter, 2_Weak_Promoter, 3_Poised_Promoter, 4_Strong_Enhancer, 5_Strong_Enhancer, 6_Weak_Enhancer, 7_Weak_Enhancer, 8_Insulator, 9_Txn_Transition, 10_Txn_Elongation, 11_Weak_Txn, 12_Repressed, 13_Heterochrom/lo, 14_Repetitive/CNV, and 15_Repetitive/CNV. These are the probabilistic categories based solely on the nine chromatin marks [
Combining the 9 Browser Extensible Data (BED) files into an integrated single file: the annotations are contained in nine separate BED files: embryonic stem cells (H1hesc. bed), erythrocytic leukaemia cells (K562.bed), B-lymphoblastoid cells (Gm12878.bed), hepatocellular carcinoma cells (Hepg2.bed), umbilical vein endothelial cells (Huvec.bed), skeletal muscle myoblasts (Hsmm.bed), normal lung fibroblasts (Nhlf.bed), normal epidermal keratinocytes (Nhlf. bed), and mammary epithelial cells (Hmec.bed) [
Flowchart of building Markov chains by iteratively eliminating highly variable 200-bp units.
Statistical distribution of variability counts of each chromatin state of 200-bp units
Variability counts | Annotation frequencies |
---|---|
1 | 5,721,116 |
2 | 4,557,108 |
3 | 2,534,396 |
4 | 934,644 |
5 | 311,234 |
6 | 77,050 |
7 | 11,658 |
8 | 890 |
9 | 28 |
Prediction accuracy of newly built transition tables of six broad states by analyzing the variability of the chromatin states of 9 BED files
Broad chromatin states | Chromatin states | No. of training units | No. of testing units | Prediction accuracy (%) for unit variability ≤ 2 |
---|---|---|---|---|
Promoter state | 1_Active_Promoter | 66,513 | 7,390 | 59.42 |
2_Weak_Promoter | 41,279 | 4,587 | 37.74 | |
3_Poised_Promoter | 13,708 | 1,523 | 66.57 | |
| ||||
Enhancer state | 4_Strong_Enhancer | 53,192 | 5,910 | 60.49 |
5_Strong_Enhancer | 144,691 | 16,077 | 62.90 | |
6_Weak_Enhancer | 140,044 | 15,560 | 61.38 | |
7_Weak_Enhancer | 363,710 | 40,412 | 57.90 | |
| ||||
Insulator state | 8_Insulator | 898,44 | 9,983 | 27.37 |
| ||||
Transition state | 9_Txn_Transition | 40,417 | 4,491 | 26.84 |
10_Txn_Elongation | 552,758 | 61,418 | 35.31 | |
11_Weak_Txn | 3,430,120 | 381,124 | 38.51 | |
| ||||
Repressed state | 12_Repressed | 1,398,701 | 155,411 | 2.30 |
| ||||
Inactive state | 13_Heterochrom/lo | NA | NA | NA |
14_Repetitive/CNV | NA | NA | NA | |
15_Repetitive/CNV | NA | NA | NA |
BED, Browser Extensible Data; NA, not available.