### Introduction

### Methods

### Haplotype block partition methods

* Haploview (CI, FGT, and SS)*

*D*′ [3,17]: (1) “strong LD” if the one-sided upper 95% confidence bound of D′ is >0.98 and the lower bound is >0.7, (2) “strong evidence for historical recombination” if the upper confidence bound of D′ is <0.9, (3) “non-informative” otherwise. The pairs satisfying the conditions (1) and (2) are said to be informative. Once all marker pairs are classified into three categories, a region is defined as a haplotype block if the outer-most marker pair (two markers at the starting and the ending position of the region) is in “strong LD” and the proportion of the number of “strong LD” marker pairs over the number of all informative marker pairs in the region is greater than 0.95. To partition a genomic region into an optimal set of haplotype blocks, the CI algorithm adopts a greedy approach: find the longest block region by examining the proportion of “strong LD” marker pairs over all informative marker pairs located between each candidate outer-most marker pair in the remaining region at each iteration. In this way, the CI algorithm can add blocks which do not overlap with an already taken blocks.

*i,j*)-element represents the strength of LD between

*i*

^{th}and

*j*

^{th}markers, the spine of strong LD stretches along the edge of the triangular block pattern.

* MIG++ and S-MIG++*

### Experimental evaluation

### Results

^{2}values for all pairs of SNPs within a block and the average r

^{2}values for pairs of which each SNP belongs to consecutive blocks, and the average size of haplotype blocks in terms of the number of SNPs in a block and the base-pair (bp) length of haplotype blocks. When we compared the haplotype block partition results based on 20% of original SNP markers to the results based on all SNPs in the 1000G dataset, the total number of haplotype blocks was reduced to about 40% in the CI, PLINK-MIG++, and S-MIG++ results and to about 30% in the FGT and SS results. Excluding the singleton blocks from the comparison, the amount of reduction was about 40% for all methods. For CI, PLINK-MIG++ and S-MIG++, the average length of the haplotype blocks based on 20% of original SNP markers in 1000G dataset was about 1.8 times of the length of the blocks produced using all SNPs for CI, PLINK-MIG++ and S-MIG++ and about 2.4 times for FGT and SS. With changes of sampling ratio of SNP markers, the average r

^{2}within a same block remained almost unchanged, but the average r

^{2}across consecutive blocks increased with the sampling ratio. The average r

^{2}of S-MIG++ within a block and across consecutive blocks are slightly lower and higher than that of CI and PLINK-MIG++ and the difference in the haplotype block sizes is greater than the two methods even though S-MIG++ uses the same operational definition for LD block, which shows the fractional sampling methods adopted by S-MIG++ for computational efficiency also affects the LD block construction results.