### Introduction

^{-7}for 500,000 (500 K) SNPs, p = 5 × 10

^{-8}for 1 million (M) SNPs) [16, 19]. Therefore, estimating a sufficient sample size to achieve adequate statistical power is critical in the design stage of genetic association [20-24].

_{0}) while the alternative hypothesis (H

_{A}) is true. It is affected by many factors. For instance, a larger sample size is required to achieve sufficient statistical power. Although a researcher collects a large number of samples, all samples may not be necessary to be analyzed to detect evidence for association. A large sample size improves the ability of disease prediction; however, it is not cost-effective that a researcher genotypes more than the effective sample size [25]. Unless researchers estimate sample size and statistical power at the research design stage, it leads to wasted time and resources to collect samples. An effective sample size can be defined as the minimum number of samples that achieves adequate statistical power (e.g., 80% power). On the other hand, too small a sample size to detect true evidence for an association increases false negative rates and reduces the reliability of a study. False negative rates are increased by multiple factors that cause systematic biases, and such biases reduce statistical power [26]. The statistical power of 80% is used widely to avoid false negative associations and to determine a cost-effective sample size in large-scale association studies [7, 22, 23]. However, many researchers tend to overlook the importance of statistical power and sample size calculations.

### Methods

_{het}) were 1.3, 1.5, 2, and 2.5. The power and sample sizes were calculated under different ranges of factors, such as MAF of 5%, 10%, 20%, and 30%; LD of 0.4, 0.6, 0.8, and 1; disease prevalence of 0.01%, 0.1%, 5%, and 10%; and case-to-control ratio of 1:1, 1:2, 1:3, and 1:4. We assumed Hardy-Weinberg equilibrium at the disease-susceptible allele.

^{-7}for 500 K SNP markers, and p = 5 × 10

^{-8}for 1 M SNP markers). We fixed the proper range of sample sizes from 100 to 2,000 cases, because the power is too low when the sample size is below 100 cases (or trios), and the cost is too high to realistically collect samples when the sample size is above 2,000 [7, 22].

### Results

_{het}, MAF, disease prevalence, LD, and case-to-control ratio by allowing a 5% type I error rate are shown in Fig. 1. As shown in Fig. 1A, a lower sample size was required to test allelic association for a single SNP with a larger MAF at the same risk of disease (OR) under the assumptions of 5% disease prevalence, 5% α, complete LD, and 1:1 case-to-control ratio. The minimum number of cases decreased from 1,974 cases for a SNP with a MAF of 5% to 545 cases for a SNP with a MAF of 30% under the same assumption. A high-risk allele showing a high OR requires a smaller sample size to be detected under the same assumption. While an allele with an OR of 1.3 requires 1,974 cases and 1,974 controls to be significantly detected in a case-control study, a SNP with an OR of 2.5 can be detected in a study of 134 cases and 134 controls under the assumption of a MAF of 5%, disease prevalence of 5%, type I error rate of 5%, and D' of 1 (Fig. 1A). The higher prevalence and the higher LD were associated with increased statistical power: for instance, as the LD increased from 0.4 to 0.6, 0.8, and 1, the statistical power obtained from a study of 1,000 cases and 1,000 controls was obviously increased from 26.5% to 49.2%, 72.8%, and 88.4%, respectively, under the assumption of OR 1.3, 5% MAF, 5% prevalence, and 5% α level (Fig. 1B and 1C). In addition, a 1:4 case-to-control ratio, which is the golden standard ratio for the numbers of cases and controls to be collected in a case-control study, showed the most effective sample size to achieve 80% statistical power. In many clinical settings, researchers are able to obtain more data from affected individuals than healthy individuals. On the other hand, there are more healthy participants than participants with a disease in a population-based study. Therefore, the minimum numbers of cases and controls required to achieve 80% statistical power depend on the study design. For a SNP with an allelic OR of 2 and 5% MAF, 127 cases and 508 controls are required in the case of a 1:4 case-control ratio, whereas 248 cases and 248 controls are required in the case of a 1:1 ratio to achieve 80% statistical power under the assumption of 5% prevalence, complete LD, and 5% α level.

_{het}of 2 and 5% MAF in a case-control study, whereas 282 case-parent trios (846 individuals) were required under the assumption of 5% disease prevalence and complete LD by allowing a 5% α level. However, the sample sizes required in both study designs increase tremendously in a GWAS. Under the same assumptions as shown above, the number of samples increased from 248 cases for a single SNP analysis to 1,206 cases and 1,255 cases for analyses of 500 K SNPs and 1 M SNPs, respectively, based on the threshold of p-value, calculated using a strict Bonferroni correction for multiple hypotheses comparisons. The statistical power to test the same number of subjects was higher for the case-control design than for the case-parent trio design (Fig. 2).