^{1}

^{2}

^{1}

^{2}

^{*}

The achievements of genome-wide association studies have suggested ways to predict diseases, such as type 2 diabetes (T2D), using single-nucleotide polymorphisms (SNPs). Most T2D risk prediction models have used SNPs in combination with demographic variables. However, it is difficult to evaluate the pure additive contribution of genetic variants to classically used demographic models. Since prediction models include some heritable traits, such as body mass index, the contribution of SNPs using unmatched case-control samples may be underestimated. In this article, we propose a method that uses propensity score matching to avoid underestimation by matching case and control samples, thereby determining the pure additive contribution of SNPs. To illustrate the proposed propensity score matching method, we used SNP data from the Korea Association Resources project and reported SNPs from the genome-wide association study catalog. We selected various SNP sets via stepwise logistic regression (SLR), least absolute shrinkage and selection operator (LASSO), and the elastic-net (EN) algorithm. Using these SNP sets, we made predictions using SLR, LASSO, and EN as logistic regression modeling techniques. The accuracy of the predictions was compared in terms of area under the receiver operating characteristic curve (AUC). The contribution of SNPs to T2D was evaluated by the difference in the AUC between models using only demographic variables and models that included the SNPs. The largest difference among our models showed that the AUC of the model using genetic variants with demographic variables could be 0.107 higher than that of the corresponding model using only demographic variables.

Genome-wide association studies (GWASs) have identified many disease-related genetic variants, including numerous single-nucleotide polymorphisms (SNPs). Kooperberg et al. [

Some problems need to be considered when predicting disease risk according to genetic variants, and technologies are available that can help to solve these problems. First, the construction of prediction models suffers from the ‘large p, small n’ problem. That is, the number of genetic variants is much larger than the number of samples, which induces the curse of dimensionality [

Heritability is estimated as the ratio of variance caused by genetic factors to the total phenotypic variance [

For an illustrative example of our approach, we selected T2D as a trait of interest. T2D results from the interactions between environmental factors and genetic factors. Many studies have sought to predict T2D through genetic variants [

Some previous studies on T2D have been conducted using data from the Korea Association Resources (KARE) project [

In this study, we built prediction models for T2D following the methods proposed by Bae and colleagues [

We created three different SNP sets using combinations of variants from the GWAS catalog and statistically significant variants in Koreans [

The KARE project began in 2007 with Ansung and Ansan regional cohorts representative of the general Korean population. The Affymetrix Genome-Wide Human SNP array 5.0 (Affymetrix Inc., Santa Clara, CA, USA) was used to analyze the genotype data from 10,038 participants. After quality control with a Hardy-Weinberg equilibrium p-value < 10^{-6} and genotype call rates less than 95%, and with the exclusion of SNPs with a minor allele frequency < 0.05, a total of 305,799 autosomal SNPs were utilized in this analysis. After eliminating participants with samples having low call rates (less than 96%), contaminated samples, gender inconsistency, serious concomitant illness, and cryptic relatedness, 8,842 samples (4,183 males and 4,659 females) were included in the analysis. Since our study focused on T2D, we selected only T2D patients and controls by excluding 3,863 samples using the T2D diagnostic criteria summarized in

SNPs were selected by two different approaches: from a single-SNP analysis and from the GWAS catalog [

PSM is a statistical matching technique that attempts to estimate the effectiveness of treatments, policies, or other interventions by taking covariates into account [

The caliper is defined by the maximum propensity score difference within the matched pair. Three methods of matching individuals with similar propensity scores are presented based on the concept of the caliper in the R package

Since it was necessary to minimize the loss of data due to the non-matched sample and the homogenization of covariates between controls and cases, we manipulated the caliper (from 0 to 1) by increments of 0.01. We checked the p-values using the paired t-test and the Wilcoxon test to evaluate the homogeneity of the cases’ and controls’ propensity scores at each caliper increment and for each method of choosing the caliper. For each caliper, we conducted 100 experiments. To ensure demographic homogeneity of the case and control group, we only considered calipers for which the p-values of both the paired t-test and the Wilcoxon test were larger than 0.05.

As the GWAS catalog is based on populations of worldwide ancestry, while the KARE dataset is drawn from the Korean population, we carefully constructed three different SNP sets, which we denoted as KARE, GWAS + KARE, and CATAGENE. First, the KARE set consisted of the SNPs chosen by the p-values from a single-SNP analysis with adjustments for sex, age, and BMI. Second, the GWAS + KARE set was a combination of SNPs from the GWAS catalog (May 22, 2019) related to T2D and SNPs from the KARE data analysis. Third, the CATAGENE set was assembled through the steps detailed below. We first selected the genes in the GWAS catalog, and then extracted all SNPs in those genes from the KARE data. After performing a single-SNP analysis, we assembled the CATAGENE set based on the p-values. The SNPs were selected by the p-values of the univariate logistic regression for each SNP. The top 200, 500, and 1,000 SNPs were chosen based on these p-values for the prediction model.

We used only genotyped variants when choosing the candidate SNPs and constructing the prediction models. Therefore, non-genotyped variants were not included in our data, even if they were in the GWAS catalog. We found 132 SNPs in the GWAS catalog [

At first, we randomly selected two-thirds of the samples for the training set, and the remaining third was used for the test set.

The penalized SLR model used the following formula:

In this formula, _{i}_{ij}

The LASSO and EN estimates of β were obtained by minimizing the following formula.

Values of the parameter

The following five groups were then defined:

(1)Group 1: SNPs that appeared at least once in the five-fold CV.

(2)Group 2: SNPs that appeared at least twice in the five-fold CV.

(3)Group 3: SNPs that appeared at least three times in the five-fold CV.

(4)Group 4: SNPs that appeared at least four times in the five-fold CV.

(5)Group 5: SNPs that appeared in every time in the five-fold CV.

These groups represent the sets of candidate SNPs selected by SLR, LASSO and EN, which were used to construct the prediction model.

To make prediction models, we used the same prediction methods (logistic SLR, EN, and LASSO) that were used for variable selection. More specifically, for LASSO, we selected the λ value to be

As described above, we conducted 100 experiments for each caliper. First, we selected the largest caliper for which the maximum value of the experiment’s p-value was > 0.05.

The average sample sizes for nine combinations obtained using three matching methods (‘largest’, ‘smallest’, and ‘random’) and three criteria for the experiment’s p-value (minimum value, maximum value, or first-quartile value >0.05) are shown in

In this study, we used multiple statistical methods (SLR, LASSO, and EN) to select variables and various SNP sets to build prediction models of T2D. Then, we compared the AUCs of the models for each SNP set. The AUCs of the models with both SNPs and demographic covariates were close to those of the models with only covariates. This result suggests that age, sex, and BMI may be good predictors of T2D in our data.

Moreover, to estimate the pure additive contribution of SNPs in our data, we applied PSM to regulate the effects of these demographic variables. When constructing models using PSM, the AUCs of models with both SNPs and covariates were higher than those of models with only covariates. For each SNP set using PSM, we constructed the best models, which had AUC values that were on average 0.051 higher than those of the corresponding models with only demographic variables. In addition, the AUC results suggest that that the prediction of T2D may be improved by up to 0.1 by adding certain SNPs.

The largest improvement obtained by adding SNPs (delta = 0.1070) was found for the model with group 1 of the GWAS + KARE-psmmax 1000 set using the LASSO-LASSO method (variable selection and prediction model).

Some further studies are desirable to extend our study. First, there are multiple ways to match controls with cases. For example, Euclidian distance seems to be a promising way of matching cases and controls [

Conceptualization: TP. Data curation: CP, NJ. Funding acquisition: TP. Methodology: CP, NJ, TP. Writing - original draft: CP. Writing - review & editing: TP.

No potential conflict of interest relevant to this article was reported.

This research was supported by the Bio-Synergy Research Project (2013M3A9C4078158) of the Ministry of Science, ICT and Future Planning through the National Research Foundation and by a grant of the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (grant number: HI16C2037).

Principle component analysis plot. Demographic variables (sex, age, body mass index) discriminate the type 2 diabetes (T2D) cases from controls. Trait 0 (red), control; Trait 1 (blue), T2D.

Propensity score matching results (matching method = “smallest”). Green boxes represent the p-values of the Wilcoxon test. Blue boxes mean the p-values of the paired t-test. The solid green lines represent the number of matched samples with the caliper as the x-axis. The red line means p = 0.05. The p-values are represented by a log scale.

Propensity score matching results (matching method = “random”). Green boxes represent the p-values of the Wilcoxon test. Blue boxes mean the p-values of the paired t-test. The solid green lines represent the number of matched samples with the caliper as the x-axis. The red line means p = 0.05. The p-values are represented by a log scale.

Propensity score matching results (matching method = “Largest”). Green boxes represent the p-values of the Wilcoxon test. Blue boxes mean the p-values of the paired t-test. The solid green lines represent the number of matched samples with the caliper as the x-axis. The red line means p = 0.05. The p-values are represented by a log scale.

Compare age distribution between before propensity score matching (PSM) and after PSM.

Compare body mass index (BMI) distribution between before propensity score matching (PSM) and after PSM.

Graph of best area under the receiver operating characteristic curve results with caliper = 0.19. KARE, Korea Association Resources; GWAS, genome-wide association study

Graph of best area under the receiver operating characteristic curve results with caliper = 0.21. KARE, Korea Association Resources; GWAS, genome-wide association study

Type 2 diabetes (T2D) diagnostic criteria

T2D group | Normal subjects | |
---|---|---|

Fasting plasma glucose (mg/dL) | ≥ 126 | ≤ 100 |

Glycated hemoglobin (%) | ≥ 6.5 | < 5.7 |

2-Hour postprandial blood glucose (mg/dL) | ≥200 | ≤140 |

History of diabetes | Treatment for T2D | No history of diabetes |

Age of disease onset ≥ 40 y |

Differences between type 2 diabetes cases and controls

Variable | Case | Control | Total |
---|---|---|---|

No. of samples | 1288 | 3687 | 4975 |

Sex (male/female) | 671/617 | 1,679/2,008 | 2,350/2,625 |

Age, mean ± SD (y) | 55.92 ± 8.79 | 49.88 ± 8.31 | 51.44 ± 8.85 |

BMI, mean ± SD (kg/m^{2}) |
25.54 ± 3.27 | 24.09 ± 2.90 | 24.47 ± 3.06 |

SD, standard deviation; BMI, body mass index.

List of SNP sets

SNP sets | Caliper method | No. of total SNPs |
---|---|---|

KARE-200 | - | 200 |

GWAS + KARE-200 | - | 200 |

CATAGENE-200 | - | 200 |

KARE-500 | - | 500 |

GWAS + KARE-500 | - | 500 |

CATAGENE-500 | - | 500 |

KARE-1000 | - | 1,000 |

GWAS + KARE-1000 | - | 1,000 |

CATAGENE-1000 | - | 1,000 |

KARE-psmmax200 | Maximum | 200 |

GWAS + KARE-psmmax200 | Maximum | 200 |

CATAGENE-psmmax200 | Maximum | 200 |

KARE-psmmin200 | Minimum | 200 |

GWAS + KARE-psmmin200 | Minimum | 200 |

CATAGENE-psmmin200 | Minimum | 200 |

KARE-psmmax500 | Maximum | 500 |

GWAS + KARE-psmmax500 | Maximum | 500 |

CATAGENE-psmmax500 | Maximum | 500 |

KARE-psmmin500 | Minimum | 500 |

GWAS + KARE-psmmin500 | Minimum | 500 |

CATAGENE-psmmin500 | Minimum | 500 |

KARE-psmmax1000 | Maximum | 1,000 |

GWAS + KARE-psmmax1000 | Maximum | 1,000 |

CATAGENE-psmmax1000 | Maximum | 1,000 |

KARE-psmmin1000 | Minimum | 1,000 |

GWAS + KARE-psmmin1000 | Minimum | 1,000 |

CATAGENE-psmmin1000 | Minimum | 1,000 |

SNP, single-nucleotide polymorphism; KARE, Korea Association Resources; GWAS, genome-wide association study.

Data description

Training set (cases) | Test set (cases) | |
---|---|---|

Original data | 3,316 (858) | 1,659 (430) |

PSM data)^{a} |
1,626 (813) | 812 (406) |

PSM data^{b} |
1,634 (817) | 816 (408) |

Propensity score matching (PSM) data: dataset using the ‘largest’ maximum method with a caliper of 0.19.

PSM data: dataset using the ‘largest’ minimum method with a caliper of 0.21.

Average sample number when the maximum value of the experiment’s p-values was >0.05

Matching method | Average selected sample number | Caliper |
---|---|---|

Largest | 2,506 | 0.19 |

Smallest | 2,408 | 0.62 |

Random | 2,450 | 0.19 |

Average sample number when the first-quartile value of the experiment’s p-values was >0.05

Matching method | Average selected sample number | Caliper |
---|---|---|

Largest | 2,512 | 0.21 |

Smallest | 2,439 | 0.75 |

Random | 2,453 | 0.21 |

Average sample number when the minimum value of the experiment’s p-values was >0.05

Matching method | Average selected sample number | Caliper |
---|---|---|

Largest | 2,512 | 0.21 |

Smallest | 2,458 | 0.82 |

Random | 2,455 | 0.22 |

Best results in each SNP set

SNP set | Method^{a} (group) |
Covariates | SNPs + covariates | Delta |
---|---|---|---|---|

KARE-200 | EN-LASSO (5) | 0.7479 | 0.7451 | -0.0029 |

GWAS + KARE-200 | EN-SLR (3) | 0.7479 | 0.7479 | 0 |

CATAGENE-200 | SLR-SLR (4) | 0.7479 | 0.7479 | 0 |

KARE-500 | EN-SLR (5) | 0.7479 | 0.7479 | 0 |

GWAS + KARE-500 | EN-SLR (5) | 0.7479 | 0.7479 | 0 |

CATAGENE-500 | EN-SLR (4) | 0.7479 | 0.7479 | 0 |

KARE-1000 | EN-SLR (5) | 0.7479 | 0.7479 | 0 |

GWAS + KARE-1000 | EN-SLR (4) | 0.7479 | 0.7479 | 0 |

CATAGENE-1000 | SLR-LASSO (4) | 0.7479 | 0.7479 | 0 |

KARE-psmmax200 | LASSO-SLR (1) | 0.5379 | 0.5585 | 0.0206 |

GWAS + KARE-psmmax200 | SLR-SLR (1) | 0.5379 | 0.5964 | 0.0585 |

CATAGENE-psmmax200 | EN-LASSO (5) | 0.5379 | 0.538 | 0.0001 |

KARE-psmmax500 | LASSO-LASSO (5) | 0.5379 | 0.5604 | 0.0225 |

GWAS + KARE-psmmax500 | EN-EN (2) | 0.5379 | 0.5645 | 0.0265 |

CATAGENE-psmmax500 | EN-EN (3) | 0.5379 | 0.5792 | 0.0413 |

KARE-psmmax1000 | EN-EN (2) | 0.5379 | 0.5461 | 0.0082 |

GWAS + KARE-psmmax1000 | LASSO-LASSO (1) | 0.5379 | 0.6449 | 0.107 |

CATAGENE-psmmax1000 | LASSO-EN (3) | 0.5379 | 0.562 | 0.0241 |

KARE-psmmin200 | LASSO-EN (3) | 0.4808 | 0.5458 | 0.065 |

GWAS + kare-psmmin200 | SLR-SLR (2) | 0.4808 | 0.5783 | 0.0975 |

CATAGENE-psmmin200 | EN-EN (3) | 0.4808 | 0.5505 | 0.0698 |

KARE-psmmin500 | EN-LASSO (2) | 0.4808 | 0.5222 | 0.0314 |

GWAS + kare-psmmin500 | SLR-SLR (1) | 0.4808 | 0.5507 | 0.0699 |

CATAGENE-psmmin500 | EN-EN (2) | 0.4808 | 0.5584 | 0.0777 |

KARE-psmmin1000 | LASSO-LASSO (3) | 0.4808 | 0.5244 | 0.0437 |

GWAS + kare-psmmin1000 | EN-EN (3) | 0.4808 | 0.5374 | 0.0566 |

CATAGENE-psmmin1000 | EN-LASSO (2) | 0.4808 | 0.5604 | 0.0696 |

SNP, single-nucleotide polymorphism; KARE, Korea Association Resources; EN, elastic-net; LASSO, least absolute shrinkage and selection operator; GWAS, genome-wide association study; SLR, stepwise logistic regression.

Method: variable selection-prediction model.

SNPs and gene locations in the GWAS + KARE psmmax top1000 LASSO-LASSO model

SNP | Gene | SNP | Gene |
---|---|---|---|

rs4275659 | rs5215 | ||

rs2838820 | rs8181588 | ||

rs515071 | rs163177 | ||

rs919115 | rs4731420 | ||

rs1048886 | rs4607103 | ||

rs12924439 | rs6445525 | ||

rs9460546 | rs8032675 | ||

rs7767391 | rs3761980 | ||

rs2328549 | rs254271 | ||

rs10870527 | rs7403531 | ||

rs12075929 | rs7593730 | ||

rs17045328 | rs10030238 | ||

rs17072023 | rs11855644 | ||

rs2845573 | rs12440511 | ||

rs1799884 | rs560792 | ||

rs780094 | rs9552911 | ||

rs1470579 | rs8192675 | ||

rs864745 | rs2548724 | ||

rs4275659 | rs10933537 |

SNP, single-nucleotide polymorphism.