### Introduction

*CCL3L1*, for example, are associated with lower risks of HIV infection and the progression to AIDS [11]; furthermore, in a genome-wide study, the level of population differentiation at this locus was found to be extraordinary compared to that of other CNVs, suggesting that natural selection may have influenced

*CCL3L1*copy number in humans [3].

*F*

_{ST}measure is useful for identifying regions of the genome affected by natural selection. In a recently published review paper, Holsinger and Weir define Wright's

*F*-statistics (

*F*

_{ST}in particular) [12] and describe methods-of-moment estimates and how

*F*

_{ST}estimates should be interpreted [13]. Although the authors also mention the maximum likelihood Bayesian estimates of

*F*

_{ST}, their description is very limited in identifying genomic regions under selection and do not present an application to the datasets. They simply compared locus-specific estimates of

*F*

_{ST}with its genome-wide distribution, and therefore, probabilities are not attached to those with a higher

*F*

_{ST}.

*F*

_{ST}at that locus will be larger than at loci in which among-population differences are purely a result of genetic drift. Therefore, genome scans that compare single-locus estimates of

*F*

_{ST}with the genome-wide background might identify regions of the genome that have been subjected to diversifying selection [15].

*F*

_{ST}estimation by maximum likelihood Bayesian method with the publicly available high-resolution CNV dataset generated by Conrad et al. [4]. We will also show that this method identifies several genomic regions showing signals of natural selection. Even though we focus mainly on the CNV data here, the detailed steps of analysis are analogous for other types of molecular marker data, such as microsatellites, SNPs, and amplified fragment-length polymorphisms.

### Method-of-moment (or ANOVA) estimates of *F*_{ST}

### *F*-statistics

*F*-statistics (

*F*

_{ST},

*F*

_{IT}, and

*F*

_{IS}) as a tool for describing the partitioning of genetic diversity within and among populations that are directly related to the rates of evolutionary processes, such as migration, mutation, and drift. Specifically,

*F*-statistics can be defined in many different ways: in terms of variances of allele frequencies, correlations between random gametes, and probabilities that two gametes chosen have different alleles. Depending on the relativity to the subpopulation or to the total population,

*F*

_{ST},

*F*

_{IT}, and

*F*

_{IS}are defined, where subscript IS refers to 'individuals within subpopulations,' ST to 'subpopulations within the total population,' and IT to 'individuals within the total population.'

*F*-statistics are defined in terms of the variance components - that is, the total variation in the genetic data is broken down into three components: (a) between subpopulations within the total population (we sometimes say 'between populations'); (b) between individuals within subpopulations; and (c) between gametes within individuals.

*F*

_{ST},

*F*

_{IT}, and

*F*

_{IS}are defined as the expectations under the model of a/(a + b + c), (a + b)/(a + b + c), and b/(b + c) and estimated by the corresponding sample values [17, 18]. Here, it is perhaps pertinent to mention that when Weir and Cockerham [18] presented these definitions, they assumed a model consisting of an ancestral population from which subpopulations have descended in isolation under the same evolutionary processes. Thus, it is meaningful to have a single measure of population structure; that is, a global

*F*

_{ST}, which is an average over subpopulations. However, in identifying candidate loci under natural selection, evidence for locus-specific selection is of interest, and thus the estimators of locus-specific

*F*

_{ST}will be described in the next section.

*F*

_{ST}as a measure of the progress of the subpopulation towards the fixation of one allele of each locus in the absence of mutation and hence called a 'fixation index.'

*F*

_{ST}is also interpreted as a measure of shared ancestry with the subpopulations, relative to that in the population, and is thus called the 'coancestry coefficient' [19]. Therefore, if the value of

*F*

_{ST}is small, it means that the allele frequencies within each subpopulation are similar; if it is large, it means that the allele frequencies of subpopulations are different. On the other hand,

*F*

_{IS}or

*F*

_{IT}is defined as the correlation between two gametes that form a zygote relative to the subpopulation or population, and thus,

*F*

_{IS}(or

*F*

_{IT}) is called the 'inbreeding coefficient' [19].

### Estimating *F*_{ST} by ANOVA methods

*F*-statistics proposed by Weir [17] and Weir and Cockerham [18] are based on an analysis of variance (ANOVA) of allele frequencies, equivalently called the method-of-moments estimates. The weighted ANOVA estimates of

*F*

_{ST},

*F*

_{IT}, and

*F*

_{IS}may be expressed in terms of the mean sum of squares for gametes (

*MSG*), individuals (

*MSI*), and populations (we sometimes say 'between subpopulations') (

*MSP*), where the mean squares are estimated by an ANOVA model. In estimating

*F*

_{ST}specifically for our analysis of CNV data, we need to consider unbalanced samples (i.e., populations of unequal size). However, as the formulas are messy, we present here those for balanced samples. Formulas for unbalanced samples can be found in Rousset (in Appendix A) [20].

*F*-statistics used here is

*Q*values are probabilities of identity in state:

*Q*among the genes (gametes) within individuals,

_{1}*Q*among genes in different individuals within populations, and

_{2}*Q*among the populations. The estimates are expressed in terms of observed frequencies of identical pairs of genes in the sample, with the following relationships:

_{3}*n*is the sample size of each population. Then, the single locus estimator is given by

*n*will be defined below. If one needs to obtain the multilocus estimator of , it is usual to compute the estimator as a sum of locus-specific numerators over a sum of locus-specific denominators (see Weir [17] and Weir and Cockerham [18]). This is the case that map information for SNPs is obtained for each gene, and a weighted-average

_{c}*F*

_{ST}from all SNPs is estimated for each gene [18]. For a set of

*I*loci, the multilocus ANOVA estimators are

*n*= (

_{c}*S*

_{1}-

*S*

_{2}/

*S*

_{1})/(

*n*-1), where

*S*is the total sample size and

_{1}*S*

_{2}is the sum of squared sample sizes of populations [21]. For convenience, we denote the estimator by

*F*

_{ST}.

*F*

_{ST}. In order to distinguish from those of the method-of-moments estimates of Weir [17] and Weir and Cockerham [18], we will call the estimates of GENEPOP ANOVA estimates.

*F*

_{ST}can be negative when levels of differentiation are close to zero and/or sample sizes are small, indicating no population differentiation at these loci [18]. One can assign a value of zero to negative

*F*

_{ST}estimates.

### Identifying CNVs under Selection Using a Bayesian Method

### Identifying selection

*F*

_{ST}is of great interest, in view of our increased knowledge of the relationships among human populations from genome-wide patterns of variation. The logic for identifying selection is straightforward. The pattern of genetic differentiation at a neutral locus is completely determined by the demographic history, migration rates among the populations, and the mutation rates at the loci. It is reasonable to assume that all autosomal loci have experienced the same demographic history and migration rates among the populations, and the observed population structure can be largely explained by random drift at neutral loci. However, as the individuals from different populations often vary genetically at a few key sites in their genome, loci showing unusually large amounts of differentiation may indicate regions of the genome that have been subject to positive selection, whereas loci showing unusually small amounts of differentiation may indicate regions of the genome that have been subject to stabilizing (balancing) selection [15, 22-24]. Thus, the outlier method makes it possible to detect divergences in some loci of the genome due to selection.

### Bayesian method

*p*and variance

*p*(1-p)/(1+θ) so that

*F*

_{ST}= 1/(1+θ). Then, they used a likelihood-based approach to estimate population- and locus-specific

*F*

_{ST}.

*F*

_{ST}: in the case of the migration-drift equilibrium model,

*F*

_{ST}measures how divergent each subpopulation is from the total population, while in the case of the fission model, it measures the degree of genetic differentiation between each descendant population and the ancestral population.

*J*subpopulations and a set of

*I*loci. Let

*K*> be the number of alleles at the

_{i}*i*th locus. The extent of differentiation between subpopulation

*j*and the ancestral population at locus

*i*is measured by . Let

**p**

*= {*

_{i}*p*} denote the allele frequencies of the ancestral population at locus

_{ik}*i*, where

*p*is the frequency of the allele

_{ik}*k*at locus

*i*( = 1). We use

**p**= {

**p**

*} to denote the entire set of allele frequencies of the ancestral population and to denote the current allele frequencies at locus*

_{i}*i*for subpopulation

*j*. Under the model and the definitions above, the allele frequencies at locus

*i*in subpopulation

*j*follow a Dirichlet distribution with parameters θ

_{j}

**p**

_{i}, a Bayesian prior distribution,

*= 1/-1. The extent of differentiation at locus*

_{ij}*i*between subpopulation

*j*and the ancestral population is measured by and is the result of its demographic history. The full prior distribution across loci and populations is given by

*, shared by all loci and a locus-specific component, α*

_{j}*, shared by all populations [22]:*

_{i}*I*·

*J*coefficients (as in the method of moments or ANOVA), we only have to estimate

*I*the parameters α

*and the*

_{i}*J*parameters β

*. With the estimates of α*

_{j}*and β*

_{i}*, (and equivalently θ*

_{j}*= exp(-(α*

_{ij}*+ β*

_{i}*)) can be estimated.*

_{j}*is our objective to detect outlier loci as selection candidates. Departure from neutrality at a given locus is assumed when α*

_{i}*is significantly different from 0 at that locus. A positive value of α*

_{i}*suggests diversifying selection, whereas negative values suggest balancing selection. The posterior probability that a locus is subject to selection,*

_{i}*P*(α

*≠ 0), is estimated directly from the Markov Chain Monte Carlo method (MCMC) (see below).*

_{i}### Likelihood for allele counts

*n*

*. We use α*

_{ij}*to denote the number of alleles*

_{ijk}*k*observed at locus

*i*in the sample from subpopulation

*j*. Thus,

*n*

*= Σ*

_{ij}*α*

_{k}*. The full dataset can be presented as a matrix*

_{ijk}**N**= {

**a**}, where

_{ij}**a**= { α

_{ij}*, α*

_{ij1}*, ..., α*

_{ij2}*} is the allele count at locus*

_{ijKi}*i*for subpopulation

*j*. The observed allele frequencies,

**a**, can be considered as sampled from the true alleles frequencies and therefore can be described by the multinomial distribution:

_{ij}**a**, the allele frequency counts at a locus within a subpopulation, and the reasons of its adoption are explained in Foll and Gaggiotti [26]:

_{ij}**p**

*= {*

_{i}*p*} are unknown, they are estimated by introducing a noninformative Dirichlet prior,

_{ik}**p**

_{i}~ Dir (1, … , 1), into the Bayesian model [27].

*and for the locus effects α*

_{j}*. Their means and variances are chosen to improve convergence and for each to have non-negligible density over almost the whole interval from 0 to 1.*

_{i}### Identifying CNVs subject to selection

*in the logistic regression. Thus, two models in the logistic, one that includes both effects of α*

_{i}*(i.e., α*

_{i}*≠ 0) and β*

_{i}*(selection model, M2) and another one that does not include the effect of selection (i.e., α*

_{j}*= 0) and only includes the effect of β*

_{i}*(neutral model, M1), are considered.*

_{j}**N**= {

**a**}. We use a reversible jump MCMC algorithm [28] to estimate the posterior probability of each one of these models, and this is done separately for each locus

_{ij}*i*. We can then have posterior probability that a locus is subject to selection - that is,

*P*(α

*≠0). This probability is estimated directly from the output of the MCMC by simply counting the number of times α*

_{i}*is included in the model (see Foll and Gaggiotti [27] for a more detailed explanation).*

_{i}*P*(

**N**| M2)/

*P*(

**N**| M1), where

**N**= {

**a**} is a dataset and

_{ij}*P*indicates probability. This BF provides a degree of evidence in favor of one model versus another. In the context of multiple testing - that is, testing a large number of loci simultaneously - we also need to incorporate our skepticism about the chance that each locus is under selection. This is done by setting the prior odds for the neutral model

*P*(M1)/

*P*(M2). We make selection decisions by using PO, which is defined as PO =

*P*(M2|

**N**)/

*P*(M1|

**N**) = BF ×

*P*(M2)/

*P*(M1). PO are simply the ratio of posterior probabilities and indicate how much more likely the model with selection (M2) is compared to the neutral model (M1) (GENEPOP software). PO of 0.5-1.0, 1.0-1.5, 1.5-2.0, and 2-∞ are, respectively, considered as substantial, strong, very strong, and decisive evidence for selection [14].

### Empirical Distribution of *F*_{ST} Based on CNV

### Study populations

### Empirical distribution of *F*_{ST}

*F*

_{ST}was 0.0853 ± 0.1195 (median, 0.0360; range, -0.0071 to 0.8994) for the ANOVA method and 0.2459 ± 0.0115 (median, 0.2462; range, 0.2193 to 0.6166) for the Bayesian method. The empirical distributions of

*F*

_{ST}are found in Fig. 1. As can be seen from the Figure, low

*F*

_{ST}values are prevalent, and a large variability persisted in the ANOVA method, while

*F*

_{ST}distribution was sharply focused in the Bayesian method. The comparison of empirical distributions of the ANOVA and Bayesian methods demonstrates that Bayesian estimates perform better than ANOVA in identifying CNVs affected by natural selection.

### Bayesian outlier detections

### Outliers detected by BayeScan

### Conclusion

*F*

_{ST}and also on sampling variances of

*F*

_{ST}. Very large variances that are associated with single locus moment estimates of

*F*

_{ST}preclude the use of these estimates to detect selection in spite of the fact that sampling variances will decrease with the number of alleles at a locus and with the numbers of populations sampled. In this respect, the availability of locus- and population-specific Bayesian estimates of

*F*

_{ST}provides a set of tools for identifying genomic regions or populations with unusual evolutionary histories. The most important benefits of Bayesian estimates and its selection method are that the Bayesian methods allow probability statements to be made about

*F*

_{ST}and can be extended to explore the relationship with demographic or environmental covariates in the model [26]. Furthermore, likelihood-based Bayesian methods have the flexibility to accommodate missing data. However, implementations of Bayesian methods may be computationally demanding.