^{1}

^{2}

^{2}

^{*}

^{†}

Data used in preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at:

Mild cognitive impairment (MCI) is a clinical syndrome characterized by the onset and evolution of cognitive impairments, often considered a transitional stage to Alzheimer’s disease (AD). The genetic traits of MCI patients who experience a rapid progression to AD can enhance early diagnosis capabilities and facilitate drug discovery for AD. While a genome-wide association study (GWAS) is a standard tool for identifying single nucleotide polymorphisms (SNPs) related to a disease, it fails to detect SNPs with small effect sizes due to stringent control for multiple testing. Additionally, the method does not consider the group structures of SNPs, such as genes or linkage disequilibrium blocks, which can provide valuable insights into the genetic architecture. To address the limitations, we propose a Bayesian bi-level variable selection method that detects SNPs associated with time of conversion from MCI to AD. Our approach integrates group inclusion indicators into an accelerated failure time model to identify important SNP groups. Additionally, we employ data augmentation techniques to impute censored time values using a predictive posterior. We adapt Dirichlet-Laplace shrinkage priors to incorporate the group structure for SNP-level variable selection. In the simulation study, our method outperformed other competing methods regarding variable selection. The analysis of Alzheimer’s Disease Neuroimaging Initiative (ADNI) data revealed several genes directly or indirectly related to AD, whereas a classical GWAS did not identify any significant SNPs.

Mild cognitive impairment (MCI) is a clinical syndrome characterized by the onset and evolution of cognitive impairments. As 10%–15% of MCI patients develop Alzheimer's disease (AD) annually, MCI is commonly regarded as a transitional stage to AD. Identifying genetic characteristics among MCI patients who experience an accelerated progression to AD is important in enabling early diagnosis and facilitating drug discovery for AD. Genome-wide association studies (GWAS) are a standard tool for identifying single nucleotide polymorphisms (SNPs) associated with specific clinical conditions or outcomes. Researchers can delineate the genetic factors related to the rapid progression from MCI to AD as a phenotype in a GWAS by using the time of conversion from MCI to AD.

As approximately 500,000 to one million candidate SNPs exist, a GWAS deals with high-dimensional data, where the number of variables (SNPs in a GWAS) ^{–8} [

To resolve the multiple testing issue, many statistical methods have been developed in terms of penalization [

We propose a Bayesian bi-level variable selection (BBVS) method to detect SNPs associated with a time-to-event outcome by considering all the SNPs simultaneously and incorporating the group information of the SNP data, based on an accelerated failure time (AFT) model. Our method has two hierarchical levels of variable selection: the first level is group-wise and the second level is element-wise variable selection. In the first level, we identify important groups of variables by employing group inclusion indicators in the AFT model and update the censored event time from its predictive posterior distribution by data augmentation [

The rest of this paper is organized as follows. In the “Methods” section, we discuss our BBVS in the AFT model. In the Results and Discussion section, we present simulation study results to validate and compare the performance of BBVS with other group/bi-level selection methods. We discuss the real data analysis results for Alzheimer’s Disease Neuroimaging Initiative (ADNI) data.

An AFT model is a parametric model to analyze a time-to-event outcome. While Cox regression postulates that covariates are multiplicatively related to the hazard, an AFT model assumes a direct relationship between time to event and covariates, which enables straightforward interpretation of regression coefficients. For the _{i}_{i}_{i1},_{i2}, ⋯, _{ip}_{0},_{1}, ⋯, _{p}_{0}, and _{i}_{i}_{i}

We propose a BBVS method on the AFT model. This method has two hierarchical levels of variable selection, the group-wise and the element-wise variable selection. It is motivated by natural grouping structures of SNPs, which can be captured by genes or LD blocks. By making use of the group structure in the model frame, we can efficiently select a small number of SNPs associated with a time-to-event outcome.

With predefined

where _{1}, ⋯, _{G}_{1}, ⋯, _{G}_{g}_{g}_{g}_{i}^{2}); hence, that the failure time _{i}_{i}_{i}_{i}_{i}_{i}_{i}_{i}_{i}_{i}_{i}

Our bi-level variable selection method addresses two issues in the model (1): the selection of the relevant groups of SNPs and the imputation of the censored time to event _{i}_{i}

where _{g}_{i}_{i}

We employ a shrinkage prior on the regression parameters _{g}_{0},_{0},

In the first step, we consider the following conjugate priors.

In the prior, _{g}_{g}_{g}_{g}_{g}^{2}) is given by

By integrating out _{0}, ^{2}, we can obtain the posterior distribution of

For a given _{1}, ⋯, _{g-1},_{g+1}, ⋯, _{G}_{g}

and _{t}^{2}) denotes the probability density function of t-distribution with the degrees of freedom ^{2}. Then, update p_{g} from its posterior distribution Beta (_{g}_{g}

which is proportional to the truncated

By using the full conditional distribution of _{i}_{i}_{i}_{i,j}_{γ}_{(i,j)} is the matrix _{γ}_{γ}^{(i)} be the vector _{i}

where _{wi}_{wi}_{0}-1 are respectively the location, scale, and degrees of freedom parameters. The location and scale parameters are given by

The censored _{i}

After running Gibbs sampling with ^{*}, we consider that any group with ^{*} to control the average Bayesian FDR by using the method proposed by Morris et al. [

In this step, we include all the variables of the _{1}, ⋯, _{Q}

the true signal _{i}

where _{1}, _{2}, ⋯, _{p}

In our model framework, we have prespecified grouping information. In order to get more flexibility depending on the grouping structure, we allow the hyperparameters _{j}_{j}_{g}

where

Denote _{1}, ⋯, _{Q}^{*} is a block diagonal matrix with element matrices

where _{g}_{0}, _{0}, _{0}, _{1}, ⋯, _{g}

1. Update _{0} according to its full conditional distribution

2. Update _{g}

The design matrix

3. Let _{0}+_{0} and ^{2} from

4. Independently sample _{gj}

5. Update _{g}

6. Update _{gj}_{gj}_{gj}_{g}

7. Update _{g}

As the DL prior does not give exactly zero coefficient value, an additional step is needed to select relevant variables. We followed a simple approach to choose important variables using k-means clustering [_{j}_{j}_{i}_{i}

To reveal SNPs associated with the time of conversion to AD from MCI, we analyzed ADNI data obtained from the ADNI database (

We performed quality control (QC) steps on the raw genotype data to ensure that only high-quality data were included in the final analysis. QC procedures include (1) call rate check per subject and per SNP marker, (2) gender check, (3) sibling pair identification, (4) the Hardy-Weinberg equilibrium test, (5) marker removal by the minor allele frequency, and (6) population stratification. The second line preprocessing steps include removal of SNPs with (1) more than 5% missing values, (2) minor allele frequency smaller than 5%, and (3) Hardy-Weinberg equilibrium p-value <10^{-6}. The remaining missing genotype variables were imputed as the modal value. After the QC procedures, 347 subjects and 494,564 SNPs remained in the current study. The above procedures were carried out in PLINK version 1.9. We also calculated the LD blocks to form the SNP-sets and remove SNP-sets with a single SNP. Eventually, 421,823 SNPs were left in our analysis grouped into 16,084 SNP-sets.

We study the subjects diagnosed with MCI at the baseline visit. If an MCI patient does not progress to AD within 48 months from the baseline, we define the time of conversion of the patient as "censored." For non-censored cases, the conversion time is determined by the difference between the baseline and the time of visit when the patient was diagnosed with AD.

We generated simulation data to examine the performance of the BBVS in the AFT model. To convey the correlation structure of SNP data in practice, our SNP data is simulated from the Hapmap projects 2009 phase III data [

We considered two cases: non-censored data and censored data. In the non-censored case, the time to event outcome was generated from the model (1), where
^{*}. The value of ^{*} was set to achieve a desired censoring rate. We replicated the simulation 50 times under the same setting. We assumed the inclusion indicator _{g}

We applied BBVS on the ADNI-1 data to reveal SNPs associated with the time of conversion to AD from MCI. Other than the whole SNPs data, we also included demographic and clinical characteristics measured at the baseline, such as gender, age, handedness, marital status, education length, retirement, and Alzheimer’s Disease Assessment Scale–Cognitive Subscale (ADAS-Cog) score. The first five principal components would adjust for population stratification in the model [

We determined the threshold

_{1,1}, _{1,2}, and _{1,3} of the first selected SNP-set for 5,000 iterations of the MCMC algorithm. They show fast convergence of the algorithm, indicating its good mixing properties.

We summarized the variable selection results of BBVS to present which genes are involved in

For comparison purposes, we conducted two different types of GWASs: (1) a simple GWAS, multiple testing on each SNP and (2) kernel-machine SNP-set GWAS with the linear kernel [^{-5} suggestive significance level, where none of them had been reported in previous GWASs. ^{-8} significance level. The selected SNP-sets vary with the type of kernel. The linear kernel, the IBS kernel, and the quadratic kernel selected 5, 8, and 6 SNP-sets, respectively. Two genes, calmodulin-binding transcription activator 1 (

As the AFT model for non-censored data is the log-normal regression model, we can compare the performance of variable selection with other variable selection methods implemented based on the typical regression models. For competing methods, we considered the group Lasso (grLasso) [

We consider the following performance measurements: true positive rate (

where the

The BBVS also shows satisfactory performance in terms of selecting important variables in the censored case. The average values of

The BBVS was developed to enable bi-level variable selection as incorporating grouping information within covariates in the high-dimensional setting. In the context of GWAS, our method addressed the challenging issues by making use of natural grouping information of SNPs in the group-level variable selection step. In addition, DL priors were adapted to reflect the grouping information in the element-wise variable selection.

The simulation studies showed that our proposed method outperformed other bi-level and group-level variable selection methods in the GWAS setting for a non-censored case. We applied BBVS on the ADNI-1 data to identify relevant SNP-sets associated with the time to develop AD within MCI patients. We identified 106 informative SNPs located within 10 genes, where four genes were directly and indirectly related to AD, while the simple form of GWAS only detected 3 SNPs that had not been reported in the literature. We need to analyze other AD data sets to see if the implicated genes are reproducible when we used different subjects in the future study. We also need to conduct a simulation study to compare the variable selection performance of BBVS with other survival models that enable variable selection for the high-dimensional data.

Conceptualization: EL, JGI, HZ. Data curation: EL. Formal analysis: EL. Funding acquisition: EL, JGI, HZ. Methodology: EL, JGI, HZ. Writing – original draft: EL. Writing – review & editing: EL, JGI, HZ.

No potential conflict of interest relevant to this article was reported.

This material was based on work partially supported by the National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT) (No. NRF-2022M3J6A1084843, No. NRF-2021R1C1C1013936). This work was also partially supported by the Institute of Information & communications Technology Planning & Evaluation (IITP) grant (No. 2020-0-01441, No.RS-2022-00155857, Artificial Intelligence Convergence Research Center (Chungnam National University)). Part of this study has been published as a PhD thesis by the first author under the supervision of the co-authors (Lee E. Advanced Bayesian models for high-dimensional biomedical data. Ph.D. Dissertation. Chapel Hill: The University of North Carolina, 2016).

Supplementary data can be found with this article online at

p-values of the classic GWAS for the SNPs selected by BBVS.

Trace plot of the regression coefficients θ_1,1 of the first selected SNP-set for 5,000 iterations of the MCMC algorithm. SNP, single nucleotide polymorphism; MCM, Markov chain Monte Carlo.

Trace plot of the regression coefficients θ_1,2 of the first selected SNP-set for 5,000 iterations of the MCMC algorithm. SNP, single nucleotide polymorphism; MCM, Markov chain Monte Carlo.

Trace plot of the regression coefficients θ_1,3 of the first selected SNP-set for 5,000 iterations of the MCMC algorithm. SNP, single nucleotide polymorphism; MCM, Markov chain Monte Carlo.

Posterior inclusion probabilities of 16,106 SNP-sets. Our proposed method identified 19 important SNP-sets after Bayesian FDR correction. The solid line shows the FDR criteria, 0.941 in this data. SNP, single nucleotide polymorphism; FDR, false discovery rate.

Estimated coefficient values for 795 SNPs included in the 19 SNP-sets. We colored 106 SNPs selected in the element level in red. SNP, single nucleotide polymorphism.

A Manhattan plot with –log10(p-value) for the classical genome-wide association study. The solid and dotted lines show the 5 × 10^{-8} significance level and the 1 × 10^{-5} significance level, respectively.

LD blocks and genes detected by BBVS

Chr | Begin (bp) | End (bp) | No. of SNPs | No. of selected | Genes |
---|---|---|---|---|---|

2 | 50596 | 50665 | 70 | 13 | |

2 | 53530 | 53576 | 47 | 6 | |

4 | 104778 | 104785 | 8 | 2 | |

4 | 117728 | 117780 | 53 | 4 | |

5 | 135218 | 135255 | 38 | 3 | - |

6 | 154825 | 154859 | 35 | 5 | - |

6 | 166701 | 166774 | 74 | 7 | - |

7 | 181879 | 181955 | 77 | 7 | |

7 | 197664 | 197675 | 12 | 1 | - |

8 | 216741 | 216762 | 22 | 2 | |

8 | 224172 | 224250 | 79 | 11 | |

8 | 228413 | 228427 | 15 | 2 | |

10 | 261007 | 261038 | 32 | 4 | |

12 | 294754 | 294763 | 10 | 0 | |

14 | 332351 | 332416 | 66 | 12 | |

14 | 334919 | 334956 | 38 | 7 | - |

19 | 394149 | 394164 | 16 | 1 | |

20 | 399491 | 399513 | 23 | 5 | - |

22 | 22468984 | 22671741 | 80 | 14 |

LD, linkage disequilibrium; BBVS, Bayesian bi-level variable selection; SNP, single nucleotide polymorphism.

LD blocks and the corresponding SNPs detected by the kernel-machine method

Chr | SNP | Gene | p-value |
||
---|---|---|---|---|---|

Linear | IBS | Quadratic | |||

1 | rs12128469, rs12402763, rs7543711, rs12563394, rs2301461, rs2301462 | 5.00e-09 | 1.00e-04 | 6.00e-04 | |

2 | rs6545731, rs10169309 | 5.00e-09 | 2.00e-04 | 5.00e-09 | |

2 | rs2576778, rs880427 | 4.00e-04 | 5.00e-09 | 2.10e-03 | |

3 | rs9288812, rs10511245, rs2053627 | 5.00e-09 | 3.00e-04 | 6.00e-04 | |

3 | rs6796883, rs293779 | 5.00e-04 | 5.00e-09 | 5.00e-03 | |

3 | rs307560, rs307558 | 2.00e-04 | 5.00e-09 | 1.10e-03 | |

3 | rs6768031, rs1033222, rs1991443, rs1991442, rs1427840, rs11922896, rs6439279, rs748155, rs17275526, rs755568, rs1863916 | 2.00e-04 | 5.00e-09 | 1.10e-03 | |

5 | rs6888634, rs2577531 | 5.00e-09 | 5.00e-09 | 2.00e-04 | |

9 | rs6475646, rs10733377 | 2.00e-04 | 4.00e-04 | 5.00e-09 | |

13 | rs11164144, rs944899 | 5.00e-09 | 1.00e-04 | 1.00e-04 | |

13 | rs9508716, rs1275190, rs1275191, rs9508717, rs1314940 | 5.00e-04 | 1.00e-03 | 5.00e-09 | |

13 | rs3764109, rs9530253 | 2.00e-04 | 5.00e-09 | 4.00e-04 | |

14 | rs11850328, rs174994, rs8014403 | 8.00e-04 | 4.00e-04 | 5.00e-09 | |

16 | rs12149339, rs12933074, rs12934725, rs11861289 | 2.00e-04 | 5.00e-09 | 5.00e-09 | |

17 | rs1990185, rs17772608, rs11077582, rs12951391, rs978425, rs16977009, rs12941303, rs2158917, rs12941651, rs7213040, rs16977023, rs12709255, rs17767678, rs7214582 | AC003051.1 | 8.00e-04 | 5.00e-09 | 1.94e-02 |

21 | rs2831525, rs6516819 | LOC101927973 | 3.00e-04 | 7.00e-04 | 5.00e-09 |

LD, linkage disequilibrium; SNP, single nucleotide polymorphism; IBS, identical by state.

Group-wise variable selection performance of BBVS and other competing methods

TPR | TNR | PPV | NPV | ||
---|---|---|---|---|---|

Group-level | BBVS | 1.000 (0.000) | 1.000 (0.000) | 0.996 (0.003) | 1.000 (0.000) |

gBridge | 0.986 (0.005) | 1.000 (0.000) | 1.000 (0.000) | 1.000 (0.000) | |

gel | 0.980 (0.007) | 1.000 (0.000) | 1.000 (0.000) | 1.000 (0.000) | |

grMCP | 0.998 (0.002) | 1.000 (0.000) | 0.972 (0.009) | 1.000 (0.000) | |

grSCAD | 1.000 (0.000) | 1.000 (0.000) | 0.472 (0.007) | 1.000 (0.000) | |

grLASSO | 1.000 (0.000) | 0.990 (0.000) | 0.348 (0.007) | 1.000 (0.000) | |

cMCP | 1.000 (0.000) | 0.963 (0.002) | 0.144 (0.011) | 1.000 (0.000) | |

Element-level | BBVS | 0.686 (0.012) | 0.999 (0.000) | 0.616 (0.009) | 1.000 (0.000) |

gBridge | 0.643 (0.011) | 0.999 (0.000) | 0.503 (0.008) | 1.000 (0.000) | |

gel | 0.651 (0.012) | 0.999 (0.000) | 0.441 (0.009) | 1.000 (0.000) | |

grMCP | 0.306 (0.009) | 0.998 (0.000) | 0.165 (0.009) | 0.999 (0.000) |

BBVS, Bayesian bi-level variable selection; TPR, true positive rate; TNR, true negative rate; PPV, positive predictive value; NPV, negative predictive value.