Prediction of Quantitative Traits Using Common Genetic Variants: Application to Body Mass Index

Article information

Genomics Inform. 2016;14(4):149-159

Publication date (electronic) : 2016 December 30

doi : https://doi.org/10.5808/GI.2016.14.4.149

Sunghwan Bae ¹^,², Sungkyoung Choi ¹^,², Sung Min Kim ², Taesung Park ¹^,²^,³

¹Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul 08826, Korea.

²Bioinformatics and Biostatistics Lab, Seoul National University, Seoul 08826, Korea.

³Department of Statistics, Seoul National University, Seoul 08826, Korea.

Corresponding author: Tel: +82-2-880-8924, Fax: +82-2-883-6144, tspark@stats.snu.ac.kr

Received 2016 November 21; Revised 2016 December 06; Accepted 2016 December 06.

Abstract

With the success of the genome-wide association studies (GWASs), many candidate loci for complex human diseases have been reported in the GWAS catalog. Recently, many disease prediction models based on penalized regression or statistical learning methods were proposed using candidate causal variants from significant single-nucleotide polymorphisms of GWASs. However, there have been only a few systematic studies comparing existing methods. In this study, we first constructed risk prediction models, such as stepwise linear regression (SLR), least absolute shrinkage and selection operator (LASSO), and Elastic-Net (EN), using a GWAS chip and GWAS catalog. We then compared the prediction accuracy by calculating the mean square error (MSE) value on data from the Korea Association Resource (KARE) with body mass index. Our results show that SLR provides a smaller MSE value than the other methods, while the numbers of selected variables in each model were similar.

Keywords: body mass index; clinical prediction rule; genome-wide association study; penalized regression models; variable selection

Introduction

With the development of genotyping technologies, many disease-related genetic variants have been verified by genome-wide association studies (GWASs). Diagnosis and disease risk prediction from the utilization of the genetic variants have improved even further [1]. Direct-to-consumer genetic companies, such as 23andME (http://www.23andme.com/) and Pathway Genomics (https://www.pathway.com/), provide personal genome information services. For example, the BRCA1 and BRCA2 genes play important roles in breast cancer diagnosis and clinical treatment [2 3]. While several disease prediction studies have been conducted using disease-related genetic variants, there are some limitations to disease risk prediction. It becomes difficult to construct a disease risk prediction model, because there are typically a larger number of genetic variants than the number of individuals in the “large p small n” problem. Also, the effect size of genetic variants for most complex human diseases is small, and missing heritability exists [4]. Moreover, some loss of statistical power to identify significant associations is caused by the correlating single-nucleotide polymorphisms (SNPs) due to linkage disequilibrium (LD) [5]. Multicollinearity due to high LD among SNPs causes high variance of coefficient estimates. In order to solve these issues, various statistical approaches have been recently proposed.

Initially, a gene score (GS) was computed using statistical models for disease risk prediction [6 7 8]. These risk prediction models were created from GSs by summing up the marginal effect of each disease-associated genetic variant. Several studies have shown that GS is useful for risk prediction [9]. However, the accuracy of the risk prediction is poor when joint effects exist between multiple genetic variants [10 11].

Building a risk prediction model using multiple SNPs is an effective way to improve disease risk prediction. Multiple logistic regression (MLR) is one of the typical traditional approaches. Several studies have shown the usefulness of an MLR-based approach for creating disease risk prediction models [12 13 14]. However, the parameter estimation of MLR becomes unstable, and the predictive power of the risk prediction model decreases if there is high LD among SNPs.

In order to solve the “large p and small n” problem, many penalized regression approaches, like ridge [15 16 17], least absolute shrinkage and selection operator (LASSO) [18], and Elastic-Net (EN) [19], have been proposed. For highdimensional data, these penalized approaches have several advantages in variable selection, as well as in prediction, over non-penalized approaches. For example, several researchers showed that the utilization of a large amount of SNPs with penalized regression approaches improves the accuracy of Crohn's disease and bipolar disorder risk prediction [20 21].

It is important to build a risk prediction model that pertains to discrete variables, such as disease diagnosis. It is also important to make predictions based on continuous variables, such as human health-related outcomes. When using medicines to treat diseases, we can use genetic information to calculate the dosage, in addition to basic physical information, such as height and weight. For example, there is a prediction model for warfarin responsiveness that was made with multivariate linear regression [22]. We can apply such a model directly to disease treatment.

In this study, we focus on the prediction of quantitative traits using common genetic variants. We systematically compared the performance of prediction models through real data from the Korea Association Resource (KARE). We first selected the prediction variables using statistical methods, such as stepwise linear regression (SLR), LASSO, and EN. We then constructed commonly used risk prediction models, such as SLR, LASSO, and EN. Finally, we compared the predictive accuracy by calculating the mean square error (MSE) value for predicting body mass index (BMI). Overall, our results show that LASSO and SLR provide the smallest MSE value among the compared methods.

Methods

Data

The KARE project, which began in 2007, is an Anseong and Ansan regional society-based cohort. After applying SNP quality control criteria—Hardy-Weinberg equilibrium p < 10⁻⁰⁶, genotype call rates < 95%, and minor allele frequency < 0.01—352,228 SNPs were utilized for analysis. Also, after eliminating 401 samples with call rates less than 96%, 11 contaminated samples, 41 gender-inconsistent samples, 101 serious concomitant illness samples, 608 cryptic-related samples, and 4 samples with missing phenotype, 8,838 participants were analyzed [23]. Table 1 summarizes the demographic information. In addition, Fig. 1 shows box plots of BMI for the given demographic variables.

Table 1.

Demographic variables for KARE cohort

Fig. 1

Box plots of body mass index (BMI) for the given demographic variables.

Statistical analysis

We selected SNPs from the KARE data analysis based on single-SNP analysis and collected SNPs in the GWAS catalog [24]. Then, we performed two steps to make quantitative prediction models. First, we selected the variables by using SLR, LASSO, and EN and then built quantitative prediction models by using the same methods.

SNP sets

First, based on three different populations—overall population, Asian-only population, and Korean-only population —we collected the SNPs registered in the GWAS catalog for BMI. Second, the SNPs were selected by single-SNP analysis using linear regression with adjustments for sex, age, and area. We chose the SNPs based on the p-values. We considered the following seven SNP sets:

(1) ASIAN-100 (GWAS catalog [Asia] + Single-SNP analysis, number of SNPs = 100)

(2) KOREAN-100 (GWAS catalog [Korea] + single-SNP analysis, number of SNPs = 100)

(3) ALL-200 (GWAS catalog [All] + single-SNP analysis, number of SNPs = 200)

(4) ASIAN-200 (GWAS catalog [Asia] + single-SNP analysis, number of SNPs = 200)

(5) KOREAN-200 (GWAS catalog [Korea] + single-SNP analysis, number of SNPs = 200)

(6) GWAS-ALL (GWAS catalog [All], number of SNPs = 136)

(7) GWAS-ASIAN (GWAS catalog [Asia], number of SNPs = 16)

Step 1: Variable selection

In the KARE data, out of 8,838 individuals, we randomly selected 1,767 for test sets and composed the training set with the rest of the 7,071 participants. We selected SNPs using 5-fold cross-validation (CV) of the training set. In this case, we used SLR, LASSO, and EN to select SNPs.

The SLR model is one of the most widely used models. Let y_i be a quantitative phenotype for subject i = 1, …, n; x_ij be the value of SNP j = 1, …, p for subject i; code be 0, 1, and 2 for the number of minor alleles; and ε_i be the error term for subject i. The SLR model is

y_i = β₀ + β₁x_i1 + ... + β_px_ip + γ₁sex_i + γ₂age_i + γ₃area_i + ε_i,

where β₀ and β_j are the intercept and effect sizes of SNPs, respectively. γ₁, γ₂, and γ₃ represent the sex, age, and area of the i-th individual, respectively. Variable selection was performed by a MSE-based stepwise procedure. The stepwise procedure was performed using the R package “MASS” [25].

The LASSO and EN estimates of β were obtained by minimizing

∑i=1nyi-β0-β1χ1i-…-βpχpi-γ1sexi-γ2agei-γ3areai2+λ1∑j=1p|βj|

and

∑i=1nyi-β0-β1χ1i-…-βpχpi-γ1sexi-γ2agei-γ3areai2+λ1∑j=1p|βj|+λ2∑j=1pβj2,

respectively. The tuning parameters λ₁ and λ₂ are estimated using CV. The penalized methods were performed using the R package “glmnet” [26].

Then, we defined five groups.

(1) Group 1 (consists of SNPs that appeared at least one time in the 5-fold CV)

(2) Group 2 (consists of the SNPs that appeared at least two times in the 5-fold CV)

(3) Group 3 (consists of the SNPs that appeared at least three times in the 5-fold CV)

(4) Group 4 (consists of the SNPs that appeared at least four times in the 5-fold CV)

(5) Group 5 (consists of the SNPs that appeared in all 5-fold CVs)

Step 2: Quantitative prediction

To build a quantitative prediction model, we used the same prediction methods that were applied for the variable selection step for the comparison of these three methods in the variable selection and quantitative prediction. Each prediction model was created by using 7,071 training individuals via 5-fold CV. To compare the performance of the quantitative prediction models, we calculated the MSE by applying each quantitative prediction model using the test set (n = 1,767).

Results

To create the SNP sets associated with BMI, single-SNP analysis was performed by linear regression with adjustments for sex, age, and area. As shown in Supplementary Fig. 1, we found one significant SNP (rs17178527) after Bonferroni correction (1.45 × 10⁻⁰⁷). rs17178527 of LOC729076 has been reported as BMI-associated SNP in previous GWASs [23 27]. In addition, Supplementary Table 1 shows the results of the single-SNP analysis with p-values less than 5.00 × 10⁻⁰⁵. The SNPs that were reported to be associated with BMI in the GWAS catalog are summarized in Supplementary Table 2. Seven SNP sets are summarized in Table 2.

Table 2.

List of the SNP sets

Step 1: Variable selection

Variable selection in each SNP set was performed via 5-fold CV of the training set. Fig. 2 shows the overlapping number of selected SNPs by the variable selection methods. In addition, Table 3 provides more detailed information. Overall, SLR selected fewer SNPs than LASSO and EN. All SNPs were selected when EN was used in ASIAN-100, ASIAN-200, and KOREAN-200.

Fig. 2

Venn diagrams give us shared parts from 5-fold CV by variables selection methods. CV, cross-validation; ASIAN-100, genome-wide association study (GWAS) catalog (Asia) + singlesingle-nucleotide polymorphism (SNP) analysis; KOREAN-100, GWAS catalog (Korea) + single-SNP analysis; ALL-200, GWAS catalog (All) + single-SNP analysis; ASIAN-200, GWAS catalog (Asia) + single-SNP analysis; KOREAN-200, GWAS catalog (Korea) + single-SNP analysis; GWAS-ALL, GWAS catalog (All); GWAS-ASIAN, GWAS catalog (Asia); SLR, stepwise linear regression; LASSO, least absolute shrinkage and selection operator; EN, Elastic-Net.

Table 3.

The number of overlapping SNPs selected by 5-fold CV for each variable selection method

Step 2: Quantitative prediction

We made quantitative prediction models based on SLR, LASSO, and EN using the entire training dataset. Then, the MSE was calculated by applying the quantitative prediction models to the test dataset. Table 4 and Fig. 3 show the performance of each quantitative prediction model in the test dataset. The model using only covariates yielded an MSE value of 10.24. As can be seen from Fig. 3, the prediction model created from Group 5 yielded the smallest MSE. Fig. 4 describes the comparison results between the numbers of SNPs and MSEs from the prediction models using SLR.

Table 4.

MSE values from test dataset

Fig. 3

Each set by MSE value, x-axis are the number of CV containing the selected variable. Group 1, 5 is a model from variables of the union of CV and of the intersection of CV, respectively. The gray bar indicates the SLR, the orange bar indicates the LASSO, the blue bar indicates the EN and the black line is MSE value of 10.24 from the prediction model using only covariates. MSE, mean square error; CV, cross-validation; ASIAN-100, genome-wide association study (GWAS) catalog (Asia) + single-single-nucleotide polymorphism (SNP) analysis; KOREAN-100, GWAS catalog (Korea) + single-SNP analysis; ALL-200, GWAS catalog (All) + single-SNP analysis; ASIAN-200, GWAS catalog (Asia) + single-SNP analysis; KOREAN-200, GWAS catalog (Korea) + single-SNP analysis; GWAS-ALL, GWAS catalog (All); GWAS-ASIAN, GWAS catalog (Asia); SLR, stepwise linear regression; LASSO, least absolute shrinkage and selection operator; EN, Elastic-Net.

Fig. 4

The comparison of the results from variables selected by different methods and from creating a model using stepwise. MSE, mean square error; SNP, single-nucleotide polymorphism; ASIAN-100, genome-wide association study (GWAS) catalog (Asia) + single-SNP analysis; KOREAN-100, GWAS catalog (Korea) + single-SNP analysis; ALL-200, GWAS catalog (All) + single-SNP analysis; ASIAN-200, GWAS catalog (Asia) + single-SNP analysis; KOREAN-200, GWAS catalog (Korea) + single-SNP analysis; GWAS-ALL, GWAS catalog (All); GWAS-ASIAN, GWAS catalog (Asia); SLR, stepwise linear regression; LASSO, least absolute shrinkage and selection operator; EN, Elastic-Net.

Among all sets, the case that used LASSO to select variables and SLR to create the model showed the smallest MSE value of 9.64 in ASIAN-100, with 51 SNPs. Among the 51 SNPs of LASSO-SLR with one set from ASIAN-100, 28 SNPs were mapped to genes (Table 5). Some genes, such as FTO, GP2, AKAP6, ANKS1B, ADCY3, and ADCY8, have been reported to be associated with BMI [28 29 30 31 32 33].

Table 5.

Development of LASSO-SLR prediction model with one set from ASIAN-100 for predicting BMI

Discussion

In this study, we used statistical methods (SLR, LASSO, and EN) to select variables and build quantitative prediction models. Then, we compared the performance of the quantitative prediction models by each SNP set (ASIAN-100, KOREAN-100, ALL-200, ASIAN-200, KOREAN-200, GWAS-ALL, and GWAS-ASIAN). As a result, the performance of the prediction models using the GWAS catalog and KARE data was better than that of the prediction models using only SNPs reported in the GWAS catalog. For the case that selected variants using LASSO in ASIAN-100 and created a prediction model using SLR, the MSE value was the smallest, 9.64. At this time, the number of SNPs was 51. Also, for the model with the fewest SNPs, we selected variables using SLR from ALL-200 and created a model using SLR. The number of SNPs was 38, and the MSE value was 9.84. Through the 5-fold CV, we developed a quantitative prediction model. After calculating MSE from groups 1 to 5, when assembled with SNPs that were included in all CVs, the resulting values of MSE were small. However, when a different group was used, the MSE value was bigger than when using the covariates to build the model. Therefore, with CV, when using SNPs that match each of their CVs, the efficiency of their quantitative prediction model was high. In the variable selection, SLR performed better than other methods. SLR selected fewer SNPs than the other methods in all SNP sets while providing smaller MSEs. It seems that LASSO and EN tended to select SNPs with little contribution to BMI. For further research, we plan to perform simulation studies and a real-data analysis with other continuous traits.

There are many ways to extend the analysis of quantitative prediction studies. First, along with the application of recently developed methods, such as bootstrapping methods [34 35], we will continue to explore new ways to develop more prediction models. Second, the incorporation of rare variants can improve the performance of a quantitative prediction model. Advanced sequencing technology has made it possible to investigate the role of common and rare variants in complex disease risk prediction. Additionally, we can use biological information while choosing the variables. By using single-SNP analysis, we can use gene or pathway information to find useful SNPs [36], and from here, we can assemble an SNP set by adding an SNP list from the pathways related to the disease of interest.

Acknowledgments

This research was supported by a grant of the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (HI15C2165), and the Bio-Synergy Research Project (2013M3A9C4078158) of the Ministry of Science, ICT and Future Planning through the National Research Foundation. The GWAS chip data were supported by bioresources from the National Biobank of Korea, the Centers for Disease Control and Prevention, Republic of Korea (4845-301, 4851-302 and -307).

References

1. Kooperberg C, LeBlanc M, Obenchain V. Risk prediction using genome-wide association studies. Genet Epidemiol 2010;34:643–652. 20842684.

2. Futreal PA, Liu Q, Shattuck-Eidens D, Cochran C, Harshman K, Tavtigian S, et al. BRCA1 mutations in primary breast and ovarian carcinomas. Science 1994;266:120–122. 7939630.

3. Lancaster JM, Wooster R, Mangion J, Phelan CM, Cochran C, Gumbs C, et al. BRCA2 mutations in primary breast and ovarian cancers. Nat Genet 1996;13:238–240. 8640235.

4. Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, et al. Finding the missing heritability of complex diseases. Nature 2009;461:747–753. 19812666.

5. Wang WY, Barratt BJ, Clayton DG, Todd JA. Genome-wide association studies: theoretical and practical concerns. Nat Rev Genet 2005;6:109–118. 15716907.

6. International Schizophrenia Consortium. Purcell SM, Wray NR, Stone JL, Visscher PM, O'Donovan MC, et al. Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature 2009;460:748–752. 19571811.

7. Machiela MJ, Chen CY, Chen C, Chanock SJ, Hunter DJ, Kraft P. Evaluation of polygenic risk scores for predicting breast and prostate cancer risk. Genet Epidemiol 2011;35:506–514. 21618606.

8. Evans DM, Visscher PM, Wray NR. Harnessing the information contained within genome-wide association studies to improve individual prediction of complex disease risk. Hum Mol Genet 2009;18:3525–3531. 19553258.

9. Janssens AC, van Duijn CM. Genome-based prediction of common diseases: advances and prospects. Hum Mol Genet 2008;17:R166–R173. 18852206.

10. Weedon MN, McCarthy MI, Hitman G, Walker M, Groves CJ, Zeggini E, et al. Combining information from common type 2 diabetes risk polymorphisms improves disease prediction. PLoS Med 2006;3:e374. 17020404.

11. van der Net JB, Janssens AC, Sijbrands EJ, Steyerberg EW. Value of genetic profiling for the prediction of coronary heart disease. Am Heart J 2009;158:105–110. 19540399.

12. Lindström S, Schumacher FR, Cox D, Travis RC, Albanes D, Allen NE, et al. Common genetic variants in prostate cancer risk prediction: results from the NCI Breast and Prostate Cancer Cohort Consortium (BPC3). Cancer Epidemiol Biomarkers Prev 2012;21:437–444. 22237985.

13. Jostins L, Barrett JC. Genetic risk prediction in complex disease. Hum Mol Genet 2011;20:R182–R188. 21873261.

14. Wacholder S, Hartge P, Prentice R, Garcia-Closas M, Feigelson HS, Diver WR, et al. Performance of common genetic variants in breast-cancer risk models. N Engl J Med 2010;362:986–993. 20237344.

15. Hoerl AE. Ridge regression. Biometrics 1970;26:603.

16. Hoerl AE, Kennard RW. Ridge regression: applications to nonorthogonal problems. Technometrics 1970;12:69–82.

17. Hoerl AE, Kennard RW. Ridge regression: biased estimation for nonorthogonal problems. Technometrics 1970;12:55–67.

18. Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc Ser B Methodol 1996;58:267–288.

19. Zou H, Hastie T. Regularization and variable selection via the elastic net. J R Stat Soc Ser B Stat Methodol 2005;67:301–320.

20. Wei Z, Wang W, Bradfield J, Li J, Cardinale C, Frackelton E, et al. Large sample size, wide variant spectrum, and advanced machine-learning technique boost risk prediction for inflammatory bowel disease. Am J Hum Genet 2013;92:1008–1012. 23731541.

21. Austin E, Pan W, Shen X. Penalized regression and risk prediction in genome-wide association studies. Stat Anal Data Min 2013;6

22. Cha PC, Mushiroda T, Takahashi A, Kubo M, Minami S, Kamatani N, et al. Genome-wide association study identifies genetic determinants of warfarin responsiveness for Japanese. Hum Mol Genet 2010;19:4735–4744. 20833655.

23. Cho YS, Go MJ, Kim YJ, Heo JY, Oh JH, Ban HJ, et al. A large-scale genome-wide association study of Asian populations uncovers genetic factors influencing eight quantitative traits. Nat Genet 2009;41:527–534. 19396169.

24. Welter D, MacArthur J, Morales J, Burdett T, Hall P, Junkins H, et al. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Res 2014;42:D1001–D1006. 24316577.

25. Ripley B, Venables B, Bates DM, Hornik K, Gebhardt A, Firth D, et al. Package ‘MASS’ CRAN Repository; 2013. Accessed 2016 Dec 1. Available from: http://cran r-project org/web/packages/MASS/MASS pdf.

26. Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw 2010;33:1–22. 20808728.

27. Kim J, Namkung J, Lee S, Park T. Application of structural equation models to genome-wide association analysis. Genomics Inform 2010;8:150–158.

28. Wang KS, Liu X, Owusu D, Pan Y, Xie C. Polymorphisms in the ANKS1B gene are associated with cancer, obesity and type 2 diabetes. AIMS Genet 2015;2:192–203.

29. Frayling TM, Timpson NJ, Weedon MN, Zeggini E, Freathy RM, Lindgren CM, et al. A common variant in the FTO gene is associated with body mass index and predisposes to childhood and adult obesity. Science 2007;316:889–894. 17434869.

30. Wen W, Cho YS, Zheng W, Dorajoo R, Kato N, Qi L, et al. Meta-analysis identifies common variants associated with body mass index in east Asians. Nat Genet 2012;44:307–311. 22344219.

31. Manning AK, Hivert MF, Scott RA, Grimsby JL, Bouatia-Naji N, Chen H, et al. A genome-wide approach accounting for body mass index identifies genetic variants influencing fasting glycemic traits and insulin resistance. Nat Genet 2012;44:659–669. 22581228.

32. Sung YJ, Pérusse L, Sarzynski MA, Fornage M, Sidney S, Sternfeld B, et al. Genome-wide association studies suggest sex-specific loci associated with abdominal and visceral fat. Int J Obes (Lond) 2016;40:662–674. 26480920.

33. Stergiakouli E, Gaillard R, Tavaré JM, Balthasar N, Loos RJ, Taal HR, et al. Genome-wide association study of height-adjusted BMI in childhood identifies functional variant in ADCY3. Obesity (Silver Spring) 2014;22:2252–2259. 25044758.

34. Hall P, Lee ER, Park BU. Bootstrap-based penalty choice for the lasso, achieving oracle performance. Stat Sin 2009;19:449–471.

35. Chatterjee A, Lahiri SN. Bootstrapping Lasso estimators. J Am Stat Assoc 2011;106:608–625.

36. Eleftherohorinou H, Wright V, Hoggart C, Hartikainen AL, Jarvelin MR, Balding D, et al. Pathway analysis of GWAS provides new insights into genetic susceptibility to 3 inflammatory diseases. PLoS One 2009;4:e8068. 19956648.

Supplementary materials

Supplementary data including two tables and one figure can be found with this article online http://www.genominfo.org/src/sm/gni-14-149-s001.pdf.

Supplementary Table 1

Genetic information associated with BMI after adjusting for sex, age, and area (p < 5.00 × 10^-05)

gni-14-149-s001.pdf

Supplementary Table 2

List of SNPs reported by the GWAS catalog

gni-14-149-s002.pdf

Supplementary Fig. 1

Results of single-SNP analysis for 344,893 SNPs in KARE data. (A) QQ plot of p-values derived from linear regression with adjustments for sex, age, and area. The 95% confidence interval around the null hypothesis is displayed as the shaded portion. (B) Manhattan plot of the single-SNP analysis. The red dotted line marks the threshold for genome-wide significance (p = 1.45 × 10^-07). SNP, single-nucleotide polymorphism; KARE, Korea Association Resource; QQ, quantile-quantile.

gni-14-149-s003.pdf

Article information Continued

(open-access, http://creativecommons.org/licenses/by-nc/4.0/) :

It is identical to the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/).

Variable	Total
No. of samples	8,838
Sex (male [%]/female [%])	4,179 (47.3)/4,659 (52.7)
Area (Anseong/Ansan)	4,201/4,637
Age (mean ± SD, yr)	52.22 ± 8.92
BMI (mean ± SD, kg/m²)	24.60 ± 3.12

SNP-set	Description	No. of SNPs (GWAS catalog)	No. of SNPs (KARE)	No. of total SNPs
ASIAN-100	GWAS catalog ＋ KARE	16	84	100
KOREAN-100	GWAS catalog ＋ KARE	1	99	100
ALL-200	GWAS catalog ＋ KARE	136	64	200
ASIAN-200	GWAS catalog ＋ KARE	16	184	200
KOREAN-200	GWAS catalog ＋ KARE	1	199	200
GWAS-ALL	Only reported SNPs in GWAS catalog	136	-	136
GWAS-ASIAN	Only reported SNPs in GWAS catalog	16	-	16

SNP-sets	Variable selection method	Group 1	Group 2	Group 3	Group 4	Group 5
ASIAN-100	SLR	76	66	61	50	40
	LASSO	86	77	71	66	58
	EN	100	100	100	100	100
KOREAN-100	SLR	82	69	62	55	44
	LASSO	87	77	72	63	57
	EN	100	100	100	100	82
ALL-200	SLR	113	81	67	58	38
	LASSO	174	143	119	99	69
	EN	185	164	134	105	71
ASIAN-200	SLR	156	126	115	100	79
	LASSO	171	154	141	127	110
	EN	200	200	200	200	200
KOREAN-200	SLR	162	128	115	102	79
	LASSO	166	146	136	123	106
	EN	200	200	200	200	200
GWAS-ALL	SLR	67	44	33	25	11
	LASSO	82	58	45	32	20
	EN	85	60	45	35	2
GWAS-ASIAN	SLR	9	8	8	7	4
	LASSO	16	14	14	12	9
	EN	16	14	14	11	9

Table 4.

MSE values from test dataset

SNP-set	Variable selection methods	Group	Prediction method
SNP-set	Variable selection methods	Group	Only used covariates	SLR	LASSO	EN
ASIAN-100	LASSO	1	10.24	12.81	12.06	11.15
		2	10.24	13.08	12.35	13.02
		3	10.24	13.08	10.77	10.97
		4	10.24	12.81	11.72	11.40
		5	10.24	9.64	9.66	9.69
	EN	1	10.24	9.64	9.70	9.87
		2	10.24	9.64	9.70	9.87
		3	10.24	9.64	9.70	9.87
		4	10.24	9.64	9.70	9.87
		5	10.24	9.64	9.70	9.87
	SLR	1	10.24	19.99	18.05	12.52
		2	10.24	24.72	16.24	19.23
		3	10.24	16.94	16.39	14.67
		4	10.24	15.30	14.44	11.61
		5	10.24	9.75	9.76	9.77
KOREAN-100	LASSO	1	10.24	12.50	13.02	12.40
		2	10.24	12.04	13.99	11.51
		3	10.24	12.47	12.66	11.72
		4	10.24	10.37	14.60	13.55
		5	10.24	9.69	9.70	9.72
	EN	1	10.24	17.78	9.73	13.41
		2	10.24	17.78	9.73	13.41
		3	10.24	17.78	9.73	13.41
		4	10.24	17.78	9.73	13.41
		5	10.24	9.66	9.71	9.77
	SLR	1	10.24	20.47	13.75	12.42
		2	10.24	20.47	13.25	12.28
		3	10.24	18.25	17.19	15.87
		4	10.24	17.60	14.99	11.11
		5	10.24	9.76	9.76	9.77
ALL-200	LASSO	1	10.24	14.84	11.75	12.59
		2	10.24	15.55	12.79	13.20
		3	10.24	15.60	15.48	12.98
		4	10.24	12.86	13.85	12.24
		5	10.24	9.86	9.91	9.92
	EN	1	10.24	15.02	11.59	12.06
		2	10.24	16.10	12.73	12.64
		3	10.24	11.81	13.89	12.86
		4	10.24	13.75	12.57	11.80
		5	10.24	9.87	9.91	9.93
	SLR	1	10.24	16.03	20.81	13.01
		2	10.24	16.14	17.97	18.12
		3	10.24	20.11	18.24	18.42
		4	10.24	20.00	17.80	18.01
		5	10.24	9.84	9.85	9.86
ASIAN-200	LASSO	1	10.24	23.87	16.36	12.95
		2	10.24	13.13	19.14	13.72
		3	10.24	16.07	17.55	17.90
		4	10.24	15.46	14.29	12.47
		5	10.24	9.67	9.73	9.74
	EN	1	10.24	9.80	9.87	10.21
		2	10.24	9.80	9.87	10.21
		3	10.24	9.80	9.87	10.21
		4	10.24	9.80	9.87	10.21
		5	10.24	9.80	9.87	10.21
	SLR	1	10.24	30.37	15.40	16.63
		2	10.24	24.02	21.30	12.51
		3	10.24	23.32	26.35	25.17
		4	10.24	21.84	17.56	18.25
		5	10.24	9.87	9.86	9.87
KOREAN-200	LASSO	1	10.24	23.22	23.37	16.22
		2	10.24	13.33	17.80	13.93
		3	10.24	16.86	15.26	15.54
		4	10.24	18.91	14.15	12.88
		5	10.24	9.71	9.78	9.78
	EN	1	10.24	9.82	9.86	10.23
		2	10.24	9.82	9.86	10.23
		3	10.24	9.82	9.86	10.23
		4	10.24	9.82	9.86	10.23
		5	10.24	9.82	9.86	10.23
	SLR	1	10.24	38.31	18.12	13.61
		2	10.24	37.07	18.31	15.04
		3	10.24	29.48	18.61	16.80
		4	10.24	18.60	16.16	15.84
		5	10.24	9.93	9.93	9.92
GWAS-ALL	LASSO	1	10.24	10.79	10.90	10.84
		2	10.24	10.99	11.23	10.84
		3	10.24	10.99	10.88	10.77
		4	10.24	10.52	10.62	10.56
		5	10.24	10.27	10.27	10.27
	EN	1	10.24	10.84	10.67	10.83
		2	10.24	10.92	11.00	10.66
		3	10.24	10.92	11.24	11.02
		4	10.24	11.00	10.94	10.86
		5	10.24	10.26	10.26	10.26
	SLR	1	10.24	12.21	10.67	10.62
		2	10.24	11.95	11.69	10.74
		3	10.24	11.50	10.92	10.41
		4	10.24	11.36	11.20	10.63
		5	10.24	10.26	10.26	10.25
GWAS-ASIAN	LASSO	1	10.24	10.12	10.23	10.23
		2	10.24	10.12	10.23	10.43
		3	10.24	10.12	10.23	10.43
		4	10.24	10.12	10.42	10.43
		5	10.24	10.12	10.13	10.13
	EN	1	10.24	10.12	10.22	10.35
		2	10.24	10.12	10.35	10.36
		3	10.24	10.12	10.35	10.36
		4	10.24	10.12	10.32	10.33
		5	10.24	10.12	10.13	10.13
	SLR	1	10.24	10.35	10.37	10.36
		2	10.24	10.35	10.35	10.34
		3	10.24	10.35	10.35	10.34
		4	10.24	10.18	10.18	10.18
		5	10.24	10.17	10.17	10.17

MSE, mean square error; SNP, single nucleotide polymorphism; SLR, stepwise linear regression; LASSO, least absolute shrinkage and selection operator; EN, Elastic-Net; ASIAN-100, GWAS catalog (Asia) + single-SNP analysis; KOREAN-100, GWAS catalog (Korea) + single-SNP analysis; ALL-200, GWAS catalog (All) + single-SNP analysis; ASIAN-200, GWAS catalog (Asia) + single-SNP analysis; KOREAN-200, GWAS catalog (Korea) + single-SNP analysis; GWAS-ALL, GWAS catalog (All); GWAS-ASIAN, GWAS catalog (Asia).

Table 5.

Development of LASSO-SLR prediction model with one set from ASIAN-100 for predicting BMI

SNP	β	Region	Gene	SNP	β	Region	Gene
rs17411146	−0.41	Upstream	-	rs11984203	0.19	Intron	NUP205
rs4121165	−0.16	Intron	FAM73A	rs2726602	−0.22	Downstream	TOX
rs12142366	0.27	Intron	ELTD1	rs2721109	−0.18	Upstream	-
rs17130257	−0.26	Downstream	-	rs16904384	0.71	Intron	ADCY8
rs4081366	0.16	Downstream	-	rs10961819	0.18	Upstream	-
rs527248	0.21	downstream	-	rs4287251	0.64	Intron	-
rs1281296	−0.32	Downstream	ZNF648	rs11000212	0.28	Intron	ASCC1
rs12092943	0.19	Intron	PIK3C2B	rs11193517	−0.26	Downstream	-
rs6545814	0.12	Intron	ADCY3	rs11030104	−0.11	Intron	-
rs12615642	0.11	Intron	-	rs652722	−0.13	Intron	-
rs10207849	0.18	Upstream	-	rs7108746	−0.20	intron	-
rs11893160	−0.29	Intron	FHL2	rs7107562	0.25	downstream	-
rs7424822	0.28	Intron	THSD7B	rs402590	0.50	Intron	ANO2
rs9839685	0.52	Intron	ATP2B2	rs4272863	−0.30	Intron	AMN1
rs1399903	0.20	Downstream	-	rs17092358	0.16	Downstream	-
rs4626221	−0.22	Intron	-	rs2373011	0.09	Intron	ANKS1B
rs1491332	−0.21	Downstream	-	rs12229654	−0.15	Upstream	CUX2
rs10056782	0.18	Intron	PPP2R2B	rs2296189	−0.21	CDS	FLT1
rs6893893	−0.20	Intron	ATP10B	rs7995818	−0.11	Downstream	-
rs792965	−0.43	Intron	ERGIC1	rs9569190	0.33	Downstream	-
rs3857596	0.21	Downstream	-	rs10483416	0.22	Intron	AKAP6
rs1342644	0.16	Intron	PEX7	rs12597579	−0.11	Downstream	GP2
rs17178527	−0.28	-	-	rs9939609	0.31	Intron	FTO
rs4509217	0.61	Intron	HECW1	rs633265	0.22	Upstream	-
rs9987062	0.28	Downstream	C7orf66	rs4802919	0.16	Upstream	ZNF480
rs2188187	−0.29	Intron	GRM8

LASSO, least absolute shrinkage and selection operator; SLR, stepwise linear regression; ASIAN-100, GWAS catalog (Asia) + single-SNP analysis; BMI, body mass index; SNP, single-nucleotide polymorsphism.