### Introduction

### HMMs and Their Design Issues

_{ij}is a two-dimensional matrix of transition probabilities of moving from state i to state j; and e

_{i}(x) is an n × m matrix of emission probabilities of generating symbol x in state i. The key property of a Markov chain is that the probability of each symbol xi depends only on the value of the preceding symbol

*x*

_{i-1}[i.e.,

*P*(

*x*

_{i}∣

*x*

_{i-1})], not on the entire previous sequence [i.e.,

*P*(

*x*

_{i}∣

*x*

_{i -1}, . . . ,

*x*

_{1})].

_{ij}and emission probability ei(j), as in Fig. 2. The graph defines the topology of the model, while the emission and the transition probabilities define the parameters of the model. The given HMM tries to capture the statistical differences in the two hidden states of 'M' and 'U.' The transition probability represents the change of the methylation state in the underlying Markov chain. According to Fig. 2A, there is a 20% chance of moving from state 'M' to state 'U' (a

_{MU}), an 80% chance of staying in state 'M' (a

_{MM}), a 10% chance of moving from state 'U' to state 'M' (a

_{UM}), and a 90% chance of staying in 'U' (a

_{UU}). The probability of starting from M and U is 60% and 40%, respectively. M and U use different sets of emission probabilities to reflect the symbol statistics. In epigenetic studies, emission symbols are not nucleotide or amino acid sequences. Rather, emission symbols are usually defined as a value or even a vector of chromatin marks.

_{t}∈ {M, U}). The arrows in the diagram denote conditional dependencies. Unlike an ergodic HMM, the chain may not go backwards while traversing the trellis.

### Different HMM Designs for Identifying DNA Methylation Patterns

### Two-State HMMs to Differentiate Non-Enriched Genomic Regions from Enriched Ones

_{i}+ 2σ

_{i}, (1.5σ

_{i})

^{2}) for the ChIP-enriched state and N(µ

_{i}, σ

_{i}

^{2}) for the nonenriched state (where µ

_{i}and σ

_{i}are the mean and standard deviation, respectively, of probe

*i*). The parameters were based on previous results with the Affymetrix SNP arrays [20].

### Three-State HMMs for ChIP Analysis

_{0}if 1/τ ≤

*p*(

_{1,i})/

*p*(

_{2,i}) ≤ τ; α

_{1}if

*p*(

_{1,i})/

*p*(

_{2,i}) > τ; and α

_{2}if

*p*(

_{1,i})/

*p*(

_{2,i}) < 1/τ. (For a region of k bins, the notation x

_{1,i}, x

_{2,i}was used for the ChIP fragment counts in L1 and L2, and the notation

*p*

_{1,i},

*p*

_{2,i}was used for the intensity in L1 and L2, respectively, at the ith bin in that region.) The initial state is fixed to take the value of α

_{0}, since the region is assumed to start from the genomic locations where the histone modification is depleted in both libraries. The emission probability P(

*x*

_{1,i},

*x*

_{2,i}∣

*s*

_{i}) was calculated as in Fig. 4 by integrating

*p*

_{1,i}and

*p*

_{2,i}over all possible values constrained by

*s*

_{i}. The transmission probability table was trained using the Baum-Welch algorithm [21].

### Multiple-State and Multivariate HMMs for Analyzing Systematic State Dynamics of Human Cells

*X*1 ={

*x*

_{1,1},

*x*

_{1,2}, ...,

*x*

_{1,m}}, where

*x*

_{i,j}is the fragment count at the jth bin in Li and m is the number of bins. As input, it receives a list of aligned reads for each chromatin mark, which are automatically converted into presence or absence calls for each mark across the genome. Fig. 5B shows that the profile data were binarized separately at 200-base-pair resolution, based on a Poisson background model. The chromatin states were learned from the binarized data using a multivariate HMM. Fig. 5C shows a two-stage nested initialization procedure of hidden states. Ernst and Kellis [19] used an iterative learning expectation-maximization approach to infer state emission and transition parameters, with the best Bayesian Information Criterion (BIC). Fig. 5D shows that each 200-base-pair interval was then assigned to its most likely state under the model. The model assumes a fixed number of 51 distinct hidden states, including promoter-associated, transcription-associated, active intergenic, large-scale repressed, and repeat-associated states.