Over the past decade, the detection of gene-gene interactions has become more and more popular in the field of genome-wide association studies (GWASs). The goal of the GWAS is to identify genetic susceptibility to complex diseases by assaying and analyzing hundreds of thousands of single-nucleotide polymorphisms. However, such tests are computationally demanding and methodologically challenging. Recently, a simple but powerful method, named “BOolean Operation-based Screening and Testing” (BOOST), was proposed for genome-wide gene-gene interaction analyses. BOOST was designed with a Boolean representation of genotype data and is approximately equivalent to the log-linear model. It is extremely fast, and genome-wide gene-gene interaction analyses can be completed within a few hours. However, BOOST can not adjust for covariate effects, and its type-1 error control is not correct. Thus, we considered two-step approaches for gene-gene interaction analyses. First, we selected gene-gene interactions with BOOST and applied logistic regression with covariate adjustments to select gene-gene interactions. We applied the two-step approach to type 2 diabetes (T2D) in the Korea Association Resource (KARE) cohort and identified some promising pairs of single-nucleotide polymorphisms associated with T2D.

The concept of epistasis, generally defined as interactions among different genes, was first introduced in 1909 by William Bateson to describe the latent effect of one locus over another locus. A quantitative definition to the interaction was proposed in 1918 by R.A. Fisher as a statistical deviation from the additive effects of two loci on a phenotype. This definition enabled interaction analyses by testing whether products of multiple genotypes are statistically associated with phenotypes. More definitions about the gene-gene interaction have been proposed, but some are still not clearly understood. The statistical gene-gene interaction has often been confused with a biological gene-gene interaction. Particularly, the inference on a biological mechanism is complicated because of the lack of direct correspondence between statistical and biological interactions [

The method of detecting gene-gene interactions has attracted much attention in genome-wide association studies (GWASs). Including logistic regression analysis for detecting gene-gene interactions, new methods, like comparing linkage disequilibrium (LD) in case and control groups, have been recently proposed [

In this paper, we considered BOOST method, proposed by Wan et al. [

Here, we propose an efficient strategy that combines the BOOST screening stage and logistic regression method. Logistic regression generally shows good statistical power for a wide spectrum of epistasis. Screening with BOOST is a computationally efficient screening method, and a genome-wide search can be completed within a few hours. A follow-up stage of logistic regression with covariates would improve the statistical power of the model. In this paper, we first review the BOOST method and apply the proposed two-stage approach to type 2 diabetes (T2D) in a Korean population. This analysis of gene-gene interactions on a genome-wide scale with BOOST was completed within 42 hours, and we also identified several pairs of SNPs associated with T2D.

T2D is the most common form of diabetes, and unlike people with type 1 diabetes, T2D patients make insulin. However, either their pancreas does not make enough insulin or the body cannot use the insulin well enough. The prevalence of T2D has increased rapidly in recent years. The prevalence of T2D in Korea was estimated to be 7.3% (in people over 20 years of age) in 2005, and the rate of patients with T2D is expected to increase dramatically from 7.08% in 2010 to about 10.85% by 2030 [

The KARE project started in 2007 and recruited 10,000 participants aged between 49 to 60 years from Ansung and Ansan, in the Gyeonggi Province of South Korea. About 50 million autosomal SNPs were genotyped with the Affymetrix Genome-Wide Human SNP arrary 5.0 [^{–5}, the genotype calling rates were less than 95%, and the minor allele frequencies were less than 0.05. We also eliminated subjects with gender inconsistencies, those whose identity by state was more than 0.8, and those whose calling rates were less than 95%. As a result, we analyzed 8,773 participants (4,117 males and 4,656 females) with 304,245 SNPs.

An individual was coded as a T2D patient if the condition satisfied the World Health Organization (WHO) diabetes diagnostic criteria: fasting plasma glucose (glu0) ≥ 126 mg/dL, plasma glucose (glu120) ≥ 200 mg/dL 2 h after an oral dose, or glycated hemoglobin (HbA1c) ≥ 6.5%. A total of 1,169 subjects were diagnosed as cases, and the other individuals were considered controls.

We assume that there are _{l}, where

The logistic regression model with only a main effect for two SNPs, _{M}

We denote it log-likelihood value by _{F}

However, the difference in the log-likelihood needs very intensive computation for hundreds of billions of pairs of SNPs. Alternatively, there exists a one-to-one correspondence between a logistic regression model and a log-linear model in categorical data analysis [

On the basis of the equivalence between the log-linear model and its corresponding logistic regression model, BOOST constructed its test statistic using the homogeneous association model _{H}_{S}_{H}_{S}_{p}_{q}_{ijk}_{ijk}

If we let _{ijk}_{H}

The interaction effects based on the likelihood ratio test can be calculated by the following forms:

If we let

This provides another interpretation of interactions, in that the difference of two log-likelihoods is proportional to the Kullback-Leibler divergence of the joint distribution obtained under the saturated model _{S}_{H}

In particular, there is no closed form solution for homogenous association model _{H}

To address this issue, BOOST uses Kirkwood superposition approximation (KSA) [_{H}

Therefore, it can be utilized to approximate _{H}_{KSA}^{−6} [

In the screening stage of BOOST, all pairwise interactions will be evaluated by using KSA. If

We first apply the BOOST approach to select the promising pairs of SNPs with the a priori chosen τ. τ can be selected based on the available computing facility, and we set τ = 30 for our analyses. After filtering SNP pairs from the first stage with BOOST, we apply the logistic regression. BOOST can not adjust for the effects of covariates, and we applied logistic regression analysis with adjustments for sex, age, body mass index (BMI), and the top 10 principal component (PC) scores to the selected pairs of SNPs with BOOST. The logistic regression analysis was performed by using the glm function in R software. To calculate the p-values of the interaction term, we used the ANOVA function by comparing two fitted models in R.

We have carried out an interaction analysis on T2D in the KARE cohort on the genome-wide scale; 8,773 subjects with 304,245 SNPs were considered for detecting gene-gene interactions. Missing genotypes were imputed with Impute2 [

We applied the proposed two-stage approach to identify the genome-wide significant gene-gene interactions. The analyses were completed within 42 h with an Intel Core i3-4130 CPU 3.40 GHz desktop. A total of 46,282,357,890 interactions were executed, and the Bonferroni-adjusted 0.05 genome-wide significance level is 1.08e-12. Promising pairs of SNPs were selected with BOOST, and 229,965 pairs of SNPs were selected with BOOST; then 229,965 pairs of SNPs were analyzed with logistic regression.

To perform adjustments for the population substructure between individuals, we used the EIGENSTRAT [

Most significant results for interaction analyses with
KARE datasets are listed in

The analysis of gene-gene interactions on a genome-wide scale is computationally very intensive, and many computational and statistical approaches have been recently proposed to minimize the computational burden. We found that BOOST is highly computationally efficient and can filter out non-significant interaction pairs in a fast manner.

In this study, we proposed an efficient strategy to identify interactions in genome-wide SNP data. We first utilized the screening stage of BOOST to filter out non-significant pairs and then used logistic regression with several covariates, such as age, sex, BMI, and PC scores.

In real data analysis, we used the KARE cohort dataset to detect gene-gene interactions of T2D. The smallest p-value (3.11E-11) of interaction pairs in the KARE data was found for rs1402142 and rs8012584. The Bonferroni-adjusted genome-wide significance level is 1.08e-12, and this SNP pair is not significant genome-wide. This insignificance is partially attributable to the insufficient sample size. With advances in genotyping/sequencing technology, genotyping costs will be much lower, and therefore, in the near future, sufficiently large samples will become available for genegene interaction analyses, which may lead us to a better understanding of human diseases.

Data for this study was provided with biospecimens from National Biobank of Korea, the Centers for Disease Control and Prevention, Republic of Korea (4845-301, 4845-302 and 307), and this work was supported by Research Resettlement Fund for the new faculty of Seoul National University.

SNP 1 | Gene 2 | CHR 1 | Position 1 | Minor allele 1 | Major allele 1 | SNP 2 | Gene 2 | CHR 2 | Position 2 | Minor allele 2 | Major allele 2 | Interaction logistic R | p-value |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

rs1402142 | - | 4 | 64970948 | C | A | rs8012584 | - | 14 | 38826243 | A | G | 55.09085092 | 3.11E-11 |

rs1402142 | - | 4 | 64970948 | C | A | rs2183235 | - | 14 | 38828344 | G | C | 54.914105 | 3.39E-11 |

rs1402142 | - | 4 | 64970948 | C | A | rs980010 | - | 14 | 38822190 | A | G | 54.89919559 | 3.41E-11 |

rs1402142 | - | 4 | 64970948 | C | A | rs1958459 | - | 14 | 38811157 | G | A | 54.78768732 | 3.60E-11 |

rs7652843 | - | 3 | 194554885 | A | C | rs224110 | - | 10 | 64551577 | A | T | 54.43510545 | 4.27E-11 |

rs1402142 | - | 4 | 64970948 | C | A | rs7145965 | - | 14 | 38804433 | G | A | 53.5692555 | 6.48E-11 |

rs1402142 | - | 4 | 64970948 | C | A | rs1475516 | - | 14 | 38799740 | G | A | 53.17278157 | 7.84E-11 |

rs1463367 | - | 4 | 48968037 | T | C | rs10899912 | - | 10 | 44296893 | G | A | 51.89018414 | 1.45E-10 |

rs872234 | 6 | 38289804 | C | T | rs10816769 | 9 | 111857440 | C | G | 51.60050802 | 1.67E-10 | ||

rs1864433 | - | 2 | 38007984 | T | A | rs1110144 | 7 | 148001291 | A | G | 51.3563867 | 1.88E-10 |

SNP, single-nucleotide polymorphism; KARE, Korea Association Resource; CHR, chromosome.

SNP | Gene | Associated genes |
---|---|---|

rs1402142 | - | |

rs1463367 | - | |

rs872234 | ||

rs1864433 | - | |

rs8012584 | - | |

rs2183235 | - | |

rs980010 | - | |

rs1958459 | - | |

rs224110 | - | |

rs7145965 | - | |

rs1475516 | - | |

rs10899912 | - | |

rs10816769 | ||

rs1110144 |

SNP, single-nucleotide polymorphism.