The success of genome-wide association studies (GWASs) has enabled us to improve risk assessment and provide novel genetic variants for diagnosis, prevention, and treatment. However, most variants discovered by GWASs have been reported to have very small effect sizes on complex human diseases, which has been a big hurdle in building risk prediction models. Recently, many statistical approaches based on penalized regression have been developed to solve the “large p and small n” problem. In this report, we evaluated the performance of several statistical methods for predicting a binary trait: stepwise logistic regression (SLR), least absolute shrinkage and selection operator (LASSO), and Elastic-Net (EN). We first built a prediction model by combining variable selection and prediction methods for type 2 diabetes using Affymetrix Genome-Wide Human SNP Array 5.0 from the Korean Association Resource project. We assessed the risk prediction performance using area under the receiver operating characteristic curve (AUC) for the internal and external validation datasets. In the internal validation, SLR-LASSO and SLR-EN tended to yield more accurate predictions than other combinations. During the external validation, the SLR-SLR and SLR-EN combinations achieved the highest AUC of 0.726. We propose these combinations as a potentially powerful risk prediction model for type 2 diabetes.
Genome-wide association studies (GWASs) have successfully identified susceptibility variants associated with human diseases. However, most susceptibility variants have small effect sizes and explain only a small proportion of heritability [
Traditional approaches for disease risk prediction have been based on gene scores (GSs) [
For complex diseases, a more accurate and reliable prediction model is required. Multiple logistic regression (MLR) is a classification method that utilizes combined information across multiple genetic variants. Several studies have shown that the MLR-based approach is useful in building a disease risk prediction model [
As an alternative to MLR, data mining approaches have been used widely to improve risk prediction performance. In particular, support vector machine (SVM) [
As mentioned before, the problem of “large p small n” interrupts the estimation of the joint effect of multiple genetic variants. In order to overcome this, various penalized regression approaches have been proposed, such as ridge [
In this study, we investigated the effect of variable selection on the performance of prediction methods. Especially, we considered the following methods for variable selection and prediction: stepwise logistic regression (SLR), LASSO, and EN. We compared the effect of variable selection on the performance of prediction by applying them to T2D GWAS chip data. We constructed the prediction models by combining variable selection and prediction methods using the Korean Association Resource (KARE) GWAS dataset (3,180 individuals) and then evaluated the performance of the risk prediction model through both internal validation (805 individuals in the KARE testing dataset) and external validation (4,723 individuals in an external replication dataset). The external replication dataset combined two cohorts: the Health2 study (1,816 individuals) and Health Examinee (HEXA) study (3,696 individuals). In both the internal and external validation datasets, we measured the discriminative accuracy of the prediction models using AUC.
The KARE project was initiated in 2007 to undertake a large-scale GWAS with 10,038 participants from two community-based cohorts (i.e., the rural Anseong and urban Ansan cohorts). Among the participants, 10,004 samples were genotyped using Affymetrix Genome-Wide Human SNP Array 5.0 Affymetrix, Santa Clara, CA, USA). From sample and SNP quality controls, a total of 8,842 individuals were selected from the Anseong (2,374 men and 2,263 women) and Ansan (1,809 men and 2,396 women) cohorts [
In this study, a total of 3,985 samples were selected from among the 8,842 individuals using T2D diagnostic criteria [
We combined two Korean GWASs, the Health2 study (n = 1,816) and the HEXA study (n = 3,696). The Health2 study consists of community-based cohorts from 5 rural areas (i.e., Wonju, Pyeongchang, Gangneung, Geumsan, and Naju), and the HEXA study is a cohort from 14 urban areas. These samples were genotyped using the Affymetrix Genome-Wide Human SNP array 6.0. The Health2 and the HEXA cohorts have been described in previous studies [
Our investigation was based on the analysis of an external replication dataset of 4,723 samples (1,112 T2D subjects, 3,611 normal subjects) for T2D [
For the joint identification of disease susceptibility variants among a large number of SNPs, we extracted SNPs having a strong correlation with T2D via logistic regression for single-variant analysis and collected the list of reported SNPs from a GWAS catalog [
Because the components of SNPs seem to be related to the performance of risk prediction, we used two data sources (i.e., the GWAS catalog and KARE cohort). First, we collected the SNPs,
(1) ALL (SNPs only reported in the GWAS catalog) (2) ASIAN (SNPs only reported in the GWAS catalog with an Asian population) (3) KARE (only top-ranked (4) ALL + KARE (combined SNPs in the GWAS catalog and KARE cohort) (5) ASIAN + KARE (combined SNPs in the GWAS catalog with an Asian population and the KARE cohort)
In the KARE dataset, we separated 3,985 individuals (1,042 T2D subjects, 2,943 normal subjects) into a training set of 3,180 individuals (830 T2D subjects, 2,350 normal subjects) and a test set of 805 individuals (212 T2D subjects, 593 normal subjects) (see
The phenotype
For variable selection, the following SLR was conducted.
The penalized method solves the following:
Then, we defined five sets as follows:
(1) One set: set of SNPs that have a non-zero coefficient at least one time among the 5-fold CV (2) Two set: set of SNPs that have a non-zero coefficient at least two times among the 5-fold CV (3) Three set: set of SNPs that have a non-zero coefficient at least three times among the 5-fold CV (4) Four set: set of SNPs that have a non-zero coefficient at least four times among the 5-fold CV (5) Five set: set of SNPs having non-zero coefficients in the 5-fold CV,
where one set ⊃ two set ⊃ three set ⊃ four set ⊃ five set.
For construction of the risk prediction model, we considered 9 combinations of variable selection and prediction methods (i.e., SLR-SLR, SLR-LASSO, SLR-EN, LASSO-SLR, LASSO-LASSO, LASSO-EN, EN-SLR, EN-LASSO, and EN-EN). For each combination, we constructed prediction models using the entire KARE training dataset (n = 3,180).
For evaluating the risk prediction performance, we needed to assess both internally and externally to determine the performance of the prediction models. To validate the risk prediction methods, we used internal and external validation datasets from the KARE testing dataset (n = 805) and an external replication dataset (n = 4,723), respectively. In both the internal and external validation datasets, we used the AUC of the receiver operator characteristic (ROC) curve, which is widely used for risk prediction performance [
The association of T2D was analyzed using logistic regression with adjustments for age, sex, area, and BMI as covariates. As shown in
In each SNP set, the variable selection methods were applied to 5-fold CV on the training set.
We fitted the prediction models using SLR, LASSO, and EN using the entire training individuals in the KARE cohort. Then, we applied the prediction models to the KARE testing dataset and an external replication dataset. The prediction models were built based on Affymetrix 5.0, but the external replication dataset was generated by Affymetrix 6.0. In the case of the KARE dataset, nearly 90% of the SNPs belonged to the external replication dataset. Thus, we did not include untyped SNPs in the evaluation of prediction models using the replication dataset. Among the five SNP sets,
In this study, we compared the performance of risk prediction models combining variable selection and prediction methods. Also, the effect of five SNP sets (i.e., ALL, ASIAN, KARE, ALL + KARE, and ASIAN + KARE) on risk prediction performance was investigated. Overall, we confirmed that prediction models incorporating both demographic variables and genetic variables were more accurate than prediction models using only demographic variables. According to our results, the best combinations were SLR-LASSO and SLR-EN with three set from the KARE SNP set in the KARE testing dataset, whereas the SLR-SLR and SLR-EN combination with one set from the ALL + KARE SNP set outperformed all other combinations in an external replication dataset.
The analysis of risk prediction studies can be extended in several ways. First, the performance of a risk prediction model can be improved by incorporating rare variants. Advances in sequencing technology make it possible to investigate the role of common and rare variants in risk prediction of complex diseases. Wei and Lu [
This research was supported by a grant of the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (HI15C2165), and the Bio-Synergy Research Project (2013M3A9C4078158) of the Ministry of Science, ICT and Future Planning through the National Research Foundation. The GWAS chip data were supported by bioresources from the National Biobank of Korea, the Centers for Disease Control and Prevention, Republic of Korea (4845-301, 4851-302 and -307).
Supplementary data including two tables and one figure can be found with this article online
Genetic loci associated with T2D after adjusting for age, sex, area, and BMI (p < 5.00×10−05)
List of SNPs reported by a GWAS catalog
Results of single-variant analysis for 344,893 SNPs in KARE cohort. (A) QQ plot of p-values derived from logistic regression with adjustments for Age, Sex, Area, and BMI. The shaded region represents the 95% confidence interval around the null hypothesis. (B) Manhattan plot of association p-values for T2D. The x-axis indicates the physical position of each genetic variant, and y-axis plots –log10(p-value). The genome-wide significance level is set to 1.45 × 10−07 and is plotted as a red dotted line. SNP, single-nucleotide polymorphism; KARE, Korean Association Resource; QQ, quantile-quantile; BMI, body mass index; T2D, type 2 diabetes.
Group | Total individuals | T2D group | Normal group |
---|---|---|---|
KARE cohort | |||
No. of subjects | 8,842 | 1,042 | 2,943 |
Sex (male/female) | 4,183/4,659 | 539/503 | 1,355/1,588 |
Age | 52.2 ± 8.9 | 56.4 ± 8.6 | 51.1 ± 8.6 |
BMI | 24.6 ± 3.1 | 25.5 ± 3.3 | 24.1 ± 2.9 |
Area (Anseong/Ansan) | 4205/4,637 | 531/511 | 1,669/1,274 |
Health2 cohort | |||
No. of subjects | 1,816 | 794 | 770 |
Sex (male/female) | 859/957 | 370/424 | 367/403 |
Age | 60.7 ± 6.6 | 58.5 ± 7.2 | 63.6 ± 4.2 |
BMI | 24.7 ± 3.3 | 25.3 ± 3.2 | 23.9 ± 3.2 |
HEXA cohort | |||
No. of subjects | 3,696 | 318 | 2,841 |
Sex (male/female) | 1,647/2,049 | 203/115 | 1,120/1,721 |
Age | 53.2 ± 8.3 | 58.6 ± 8.0 | 52.2 ± 8.1 |
BMI | 24.0 ± 2.9 | 24.8 ± 2.9 | 23.7 ± 2.8 |
KARE, Korean Association Resource; HEXA, Health Examinee; T2D, type 2 diabetes; BMI, body mass index.
SNP set | Description | GWAS catalog |
KARE |
No. of total variants |
|
---|---|---|---|---|---|
Population | |||||
ALL | Only reported SNPs in GWAS catalog | All populations | 65 | - | 65 |
ASIAN | Only reported SNPs in GWAS catalog | Asian population | 25 | - | 25 |
KARE | Only top SNPs in KARE cohort | - | - | 100 | 100 |
ALL + KARE | GWAS catalog + KARE | All populations | 65 | 35 | 100 |
ASIAN + KARE | GWAS catalog + KARE | Asian population | 25 | 75 | 100 |
SNP, single nucleotide polymorphism; GWAS, genome-wide association study; KARE, Korean Association Resource; ALL, SNPs only reported in the GWAS catalog; ASIAN, SNPs only reported in the GWAS catalog with an Asian population; ALL + KARE, combined SNPs in the GWAS catalog and KARE cohort; ASIAN + KARE, combined SNPs in the GWAS catalog with an Asian population and the KARE cohort.
SNP set | Variable selection method | One set | Two set | Three set | Four set | Five set |
---|---|---|---|---|---|---|
ALL | SLR | 3 | 3 | 1 | - | - |
LASSO | 6 | 4 | 3 | 2 | 1 | |
EN | 6 | 6 | 5 | 3 | 2 | |
ASIAN | SLR | 3 | 3 | 1 | - | - |
LASSO | 6 | 4 | 3 | 2 | 1 | |
EN | 6 | 6 | 6 | 2 | 1 | |
KARE | SLR | 80 | 59 | 55 | 47 | 37 |
LASSO | 82 | 75 | 63 | 56 | 48 | |
EN | 100 | 100 | 77 | 68 | 54 | |
ALL + KARE | SLR | 72 | 55 | 44 | 39 | 29 |
LASSO | 84 | 74 | 61 | 52 | 37 | |
EN | 100 | 85 | 73 | 59 | 44 | |
ASIAN + KARE | SLR | 78 | 59 | 50 | 42 | 31 |
LASSO | 83 | 76 | 70 | 62 | 50 | |
EN | 83 | 76 | 70 | 62 | 50 |
SNP, single nucleotide polymorphism; CV, cross-validation; ALL, SNPs only reported in the genome-wide association study (GWAS) catalog; SLR, stepwise logistic regression; LASSO, least absolute shrinkage and selection operator; EN, Elastic-Net; ASIAN, SNPs only reported in the GWAS catalog with an Asian population; KARE, Korean Association Resource; ALL + KARE, combined SNPs in the GWAS catalog and KARE cohort; ASIAN + KARE, combined SNPs in the GWAS catalog with an Asian population and the KARE cohort.
SNP set | Variable selection method | Set (No. of SNPs) | Prediction method | No. of SNPs | AUC |
|
---|---|---|---|---|---|---|
KARE testing dataset | External replication dataset | |||||
Only demographic variables | - | - | - | - | 0.715 | 0.561 |
ALL | EN | Two set (6) | EN | 6 | 0.724 | 0.604 |
ASIAN | SLR | Two set (3) | LASSO | 3 | 0.722 | 0.601 |
KARE | SLR | Three set (55) | LASSO | 53 | 0.816 | 0.590 |
KARE | SLR | Three set (55) | EN | 53 | 0.816 | 0.590 |
KARE | SLR | Five set (37) | LASSO | 37 | 0.801 | 0.620 |
KARE | SLR | Five set (37) | EN | 37 | 0.801 | 0.620 |
ALL + KARE | EN | One set (100) | SLR | 51 | 0.774 | 0.726 |
ALL + KARE | SLR | One set (72) | SLR | 51 | 0.772 | 0.726 |
ASIA + KARE | EN | One set (83) | LASSO | 71 | 0.809 | 0.599 |
ASIA + KARE | EN | One set (83) | EN | 71 | 0.809 | 0.599 |
ASIA + KARE | SLR | Three set (50) | SLR | 49 | 0.800 | 0.694 |
KARE, Korean Association Resource; SNP, single nucleotide polymorphism; AUC, area under the receiver operating characteristic curve; ALL, SNPs only reported in the genome-wide association study (GWAS) catalog; EN, Elastic-Net; ASIAN, SNPs only reported in the GWAS catalog with an Asian population; SLR, stepwise logistic regression; ALL + KARE, combined SNPs in the GWAS catalog and KARE cohort; ASIA + KARE, combined SNPs in the GWAS catalog with an Asian population and the KARE cohort.
Variable | Region | Gene | Variable | Region | Gene | ||
---|---|---|---|---|---|---|---|
rs2236208 | 14.23 | Intron | - | rs3773506 | 0.25 | UTR-3 | |
rs2236207 | −13.88 | UTR-5 | rs515071 | −0.24 | Intron | - | |
rs2700396 | 13.78 | Intron | rs10115450 | −0.23 | Intron | ||
rs13094803 | −13.56 | Intron | rs2106294 | −0.23 | Intron | - | |
rs9939609 | 2.14 | Intron | rs6813195 | −0.21 | down | - | |
rs9460546 | 2.10 | Intron | rs1525739 | −0.21 | Down | ||
rs8050136 | −2.07 | Intron | rs360481 | −0.20 | Intron | - | |
rs10946398 | −2.02 | Intron | rs8181588 | −0.20 | intron | ||
rs11065756 | −1.16 | Intron | rs623323 | 0.20 | down | ||
rs10849915 | 0.88 | Intron | rs5015480 | 0.19 | Down | - | |
rs2074356 | −0.75 | Intron | rs10906115 | −0.19 | Up | ||
rs11066280 | 0.58 | Down | rs3132524 | −0.19 | Intron | ||
rs6439472 | 0.43 | Up | - | rs5215 | 0.18 | Missense | |
rs11086668 | 0.37 | Intron | rs1048886 | 0.17 | Missense | ||
rs6665139 | −0.34 | Down | - | rs679992 | −0.17 | Intron | |
rs10258075 | 0.34 | Intron | rs17797882 | −0.16 | Down | ||
rs3796439 | −0.33 | Intron | rs6930576 | 0.16 | Intron | ||
rs2444728 | 0.32 | Down | - | rs7403531 | 0.15 | Intron | |
rs16841450 | 0.31 | Intron | rs1436955 | −0.14 | Down | ||
rs6128654 | 0.31 | Intron | rs1495377 | 0.12 | Intron | ||
rs9465871 | −0.30 | Intron | rs1861612 | 0.11 | Intron | ||
rs1801282 | −0.30 | Intron | - | rs831571 | −0.11 | Up | - |
rs2383208 | −0.28 | Down | - | rs17045328 | 0.11 | Intron | |
rs773506 | −0.26 | Down | SEX | 0.42 | - | - | |
rs6882351 | −0.26 | Up | - | AREA | −1.05 | - | - |
rs470089 | 0.25 | Intron | AGE | 0.11 | - | - | |
rs7163430 | −0.25 | Up | BMI | 0.20 | - | - | |
rs4402960 | 0.25 | Intron |
SLR, stepwise logistic regression; ALL + KARE, combined SNPs in the genome-wide association study catalog and Korean Association Resource cohort; T2D, type 2 diabetes.