Bayesian bi-level variable selection for genome-wide survival study
Article information
Abstract
Mild cognitive impairment (MCI) is a clinical syndrome characterized by the onset and evolution of cognitive impairments, often considered a transitional stage to Alzheimer’s disease (AD). The genetic traits of MCI patients who experience a rapid progression to AD can enhance early diagnosis capabilities and facilitate drug discovery for AD. While a genome-wide association study (GWAS) is a standard tool for identifying single nucleotide polymorphisms (SNPs) related to a disease, it fails to detect SNPs with small effect sizes due to stringent control for multiple testing. Additionally, the method does not consider the group structures of SNPs, such as genes or linkage disequilibrium blocks, which can provide valuable insights into the genetic architecture. To address the limitations, we propose a Bayesian bi-level variable selection method that detects SNPs associated with time of conversion from MCI to AD. Our approach integrates group inclusion indicators into an accelerated failure time model to identify important SNP groups. Additionally, we employ data augmentation techniques to impute censored time values using a predictive posterior. We adapt Dirichlet-Laplace shrinkage priors to incorporate the group structure for SNP-level variable selection. In the simulation study, our method outperformed other competing methods regarding variable selection. The analysis of Alzheimer’s Disease Neuroimaging Initiative (ADNI) data revealed several genes directly or indirectly related to AD, whereas a classical GWAS did not identify any significant SNPs.
Introduction
Mild cognitive impairment (MCI) is a clinical syndrome characterized by the onset and evolution of cognitive impairments. As 10%–15% of MCI patients develop Alzheimer's disease (AD) annually, MCI is commonly regarded as a transitional stage to AD. Identifying genetic characteristics among MCI patients who experience an accelerated progression to AD is important in enabling early diagnosis and facilitating drug discovery for AD. Genome-wide association studies (GWAS) are a standard tool for identifying single nucleotide polymorphisms (SNPs) associated with specific clinical conditions or outcomes. Researchers can delineate the genetic factors related to the rapid progression from MCI to AD as a phenotype in a GWAS by using the time of conversion from MCI to AD.
As approximately 500,000 to one million candidate SNPs exist, a GWAS deals with high-dimensional data, where the number of variables (SNPs in a GWAS) p is much greater than sample size n. A classical GWAS conducts several association tests, so called multiple testing, that examine an individual effect of each SNP on a clinical outcome. The classical GWAS has two major limitations. First, GWAS has a multiple testing issue, which requires adequate control of false positives. Typically, the significance level of each test is adjusted by a Bonferroni correction. The significance level widely accepted to determine “genome-wide significant” association is 5 × 10–8 [1,2], which is a strict threshold and makes genome-wide significance difficult to be achieved. Second, the GWAS does not account for the intricate group structure among SNPs such as genes or linkage disequilibrium (LD) blocks. LD reflects how much an allele from a particular genetic variant is associated or inherited with an allele from another nearby genetic variant within the same population [3]. Incorporating the group information within the GWAS would increase statistical power by aggregating small effects of SNPs within a group.
To resolve the multiple testing issue, many statistical methods have been developed in terms of penalization [4-7], Bayesian variable selection [8,9], and sure independence screening strategy [10-12]. Researchers proposed variable selection methods by incorporating the group information to select genetic variants in both gene and SNP levels simultaneously [13-16]. While these methods are suitable for a continuous or binary outcome, only a limited number of studies for a time-to-event outcome are available. Bi et al. [17] developed a saddlepoint approximation implementation to correct p-values based on the Cox regression model. A Bayesian survival model with variable selection was proposed with application to GWAS [18]. Lin et al. [19] proposed kernel-machine SNP-set analysis to assess the group effect of each SNP-set on the survival time.
We propose a Bayesian bi-level variable selection (BBVS) method to detect SNPs associated with a time-to-event outcome by considering all the SNPs simultaneously and incorporating the group information of the SNP data, based on an accelerated failure time (AFT) model. Our method has two hierarchical levels of variable selection: the first level is group-wise and the second level is element-wise variable selection. In the first level, we identify important groups of variables by employing group inclusion indicators in the AFT model and update the censored event time from its predictive posterior distribution by data augmentation [18,20,21]. As this step generates posterior samples of censored time to event, their posterior mean will be used as an imputed value for the censored event time in the second level. In the second level, we only include variables in the selected groups in the first level as covariates in the regression model. To conduct element-wise variable selection, we adapt Dirichlet-Laplace shrinkage priors [22] to incorporate the group structure.
The rest of this paper is organized as follows. In the “Methods” section, we discuss our BBVS in the AFT model. In the Results and Discussion section, we present simulation study results to validate and compare the performance of BBVS with other group/bi-level selection methods. We discuss the real data analysis results for Alzheimer’s Disease Neuroimaging Initiative (ADNI) data.
Methods
Accelerated failure time model
An AFT model is a parametric model to analyze a time-to-event outcome. While Cox regression postulates that covariates are multiplicatively related to the hazard, an AFT model assumes a direct relationship between time to event and covariates, which enables straightforward interpretation of regression coefficients. For the i-th subject,Yi is survival time and Xi = (1,xi1,xi2, ⋯, xip)' is a covariate vector. The first element of 1 allows estimation of the y-intercept. Subsequently, AFT model is given by
BBVS in accelerated failure time model
We propose a BBVS method on the AFT model. This method has two hierarchical levels of variable selection, the group-wise and the element-wise variable selection. It is motivated by natural grouping structures of SNPs, which can be captured by genes or LD blocks. By making use of the group structure in the model frame, we can efficiently select a small number of SNPs associated with a time-to-event outcome.
With predefined G blocks we can write our model as follows.
where
Our bi-level variable selection method addresses two issues in the model (1): the selection of the relevant groups of SNPs and the imputation of the censored time to event yi. In the first step, we identify important groups of variables by updating only the group inclusion vector γ and the censored time yi from their posterior distributions. In the second step, the model (1) can be reduced by
where
We employ a shrinkage prior on the regression parameters θg to enable the element-wise variable selection within
The first step: group-wise variable selection
In the first step, we consider the following conjugate priors.
In the prior,
By integrating out β0, β, σ2, we can obtain the posterior distribution of γ :
For a given γ(g) = (γ1, ⋯, γg-1,γg+1, ⋯, γG), the posterior distribution of γg is the Bernoulli distribution with success probability
and
which is proportional to the truncated n-dimensional multivariate t-distribution with truncation given by (2) as follows.
By using the full conditional distribution of wi for a censored case νi = 0, the censored time wi can be imputed by its posterior mean. Denote
where μwi, swi, and n+ν0-1 are respectively the location, scale, and degrees of freedom parameters. The location and scale parameters are given by
The censored wi will be updated from (4) at each iteration and it will be imputed as their posterior mean in the element-wise selection step.
After running Gibbs sampling with M iterations, posterior inclusion probability can be calculated from the posterior sample of γ as their posterior mean,
The second step: element-wise variable selection
In this step, we include all the variables of the Q selected groups in the first step and assume shrinkage priors on the regression parameters θ1, ⋯, θQ to achieve further sparsity in the element-wise level in the reduced model (3). As a shrinkage prior, the DL prior is assumed and extended to incorporate grouping information. The DL prior has been proposed in [22] as a novel form of shrinkage prior. Under the normal means setting
the true signal θi has a DL prior, which has a hierarchical structure such that
where ϕ = (ϕ1, ϕ2, ⋯, ϕp). To efficiently control the global shrinkage, they introduced global (τ) and local (ϕ) scales, where the local scales have a joint structure such that they lie in the (p-1) dimensional simplex. Under the moderate-sized coefficients with sparse signal setting, their simulation study has shown that the DL prior outperforms least absolute shrinkage and selection operator (Lasso), Bayesian Lasso, empirical Bayes median, and point mass prior, while its performance is similar to that of Horseshoe prior.
In our model framework, we have prespecified grouping information. In order to get more flexibility depending on the grouping structure, we allow the hyperparameters ψj, ϕj, and τ in (5) to be group index(g)-dependent. In the selected group g, there are qg variables and the total number of selected variables in the model (3) is
where
Denote
where B(ϕg) denotes a multivariate Beta function. We propose a Gibbs sampler for posterior computation, which enables parameter estimation and variable selection simultaneously. The Gibbs sampler is computationally efficient and mixes rapidly. We first specified the hyperparameters h0, σ0, ν0, a1, ⋯, ag at appropriate values. Starting from the initiation step, the Gibbs sampler for the model (3) and (7) proceeds as follows:
1. Update β0 according to its full conditional distribution
2. Update θg from its full conditional distribution
The design matrix
3. Let N = n+q+p0+ν0 and
4. Independently sample ψgj from its full conditional distribution
5. Update τg from its full conditional distribution, the generalized inverse Gaussian distribution (giG), such as
6. Update ϕgj, where ϕgj = Tgj/Tg such that
7. Update ag from MN(
As the DL prior does not give exactly zero coefficient value, an additional step is needed to select relevant variables. We followed a simple approach to choose important variables using k-means clustering [22]. Two clusters of |θj |'s can exist, where (a) one cluster has nearly zero coefficient values while (b) another cluster has relatively bigger absolute coefficients away from zero. The clusters (a) and (b) can be considered as noise and signal, respectively. We cluster |θj |'s at each Markov chain Monte Carlo iteration using k-means with k = 2 clusters. At each i-th iteration, the number of important variables hi is set to be the smaller cluster size out of the two clusters. Subsequently, the number of important variables is finally estimated by taking the mode from the whole Markov chain Monte Carlo (MCMC) iterations, i.e.,H = mode{hi}. The H largest elements of the absolute values of posterior medians |θ| are identified as the important variables.
ADNI-1 data
To reveal SNPs associated with the time of conversion to AD from MCI, we analyzed ADNI data obtained from the ADNI database (https://adni.loni.usc.edu). The ADNI was launched in 2003 as a public-private partnership, led by Principal Investigator Michael W. Weiner, MD. The primary goal of ADNI has been to test whether serial magnetic resonance imaging, positron emission tomography, other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of MCI and early AD. The initial 5-year ADNI study resulted in the ADNI-1 data.
We performed quality control (QC) steps on the raw genotype data to ensure that only high-quality data were included in the final analysis. QC procedures include (1) call rate check per subject and per SNP marker, (2) gender check, (3) sibling pair identification, (4) the Hardy-Weinberg equilibrium test, (5) marker removal by the minor allele frequency, and (6) population stratification. The second line preprocessing steps include removal of SNPs with (1) more than 5% missing values, (2) minor allele frequency smaller than 5%, and (3) Hardy-Weinberg equilibrium p-value <10-6. The remaining missing genotype variables were imputed as the modal value. After the QC procedures, 347 subjects and 494,564 SNPs remained in the current study. The above procedures were carried out in PLINK version 1.9. We also calculated the LD blocks to form the SNP-sets and remove SNP-sets with a single SNP. Eventually, 421,823 SNPs were left in our analysis grouped into 16,084 SNP-sets.
We study the subjects diagnosed with MCI at the baseline visit. If an MCI patient does not progress to AD within 48 months from the baseline, we define the time of conversion of the patient as "censored." For non-censored cases, the conversion time is determined by the difference between the baseline and the time of visit when the patient was diagnosed with AD.
Simulation data
We generated simulation data to examine the performance of the BBVS in the AFT model. To convey the correlation structure of SNP data in practice, our SNP data is simulated from the Hapmap projects 2009 phase III data [35]. For each subject, we randomly combined two haplotypes from the Centre d'étude du polymorphisme humain population to form its genotypes and used PLINK [36] to form SNP-sets by determining LD blocks. Among the blocks that were >30, we randomly selected 2,000 SNP-sets in each block, which results in about 86,000 SNPs. After removing those SNP data with duplicated columns, we have about 45,000 SNPs in total.
We considered two cases: non-censored data and censored data. In the non-censored case, the time to event outcome was generated from the model (1), where
Results and Discussion
ADNI-1 data analysis
We applied BBVS on the ADNI-1 data to reveal SNPs associated with the time of conversion to AD from MCI. Other than the whole SNPs data, we also included demographic and clinical characteristics measured at the baseline, such as gender, age, handedness, marital status, education length, retirement, and Alzheimer’s Disease Assessment Scale–Cognitive Subscale (ADAS-Cog) score. The first five principal components would adjust for population stratification in the model [37]. The variable selection was only performed on the SNP data.
We determined the threshold α to control the average Bayesian FDR [34] and consider any group whose posterior inclusion probability is greater than that of α. In the ADNI data, the threshold is calculated by 0.941 (cFig. 1). In total, 19 SNP-sets were detected as important groups and 106 SNPs were identified by the elementwise-level selection. Fig. 2. shows the estimated coefficient values for 795 SNPs included in the 19 SNP-sets. We colored 106 SNPs selected in the element level in red.
Supplementary Figs. 1–3 show trace plots of the regression coefficients θ1,1, θ1,2, and θ1,3 of the first selected SNP-set for 5,000 iterations of the MCMC algorithm. They show fast convergence of the algorithm, indicating its good mixing properties.
We summarized the variable selection results of BBVS to present which genes are involved in Table 1. Among them, four genes have been reported in other studies to be related to AD directly or indirectly. Dipeptidyl-peptidase 10 (DPP10) is known to modulate the electrophysiological properties, cell-surface expression, and subcellular localization of voltage-gated potassium channels [38]. Chen et al. [39] demonstrated that aggregation of DPP10 was related to neurodegenerative disorders including AD, diffuse Lewy body disease, and fronto-temporal dementia. In addition, DPP10 had robust reactivity within neurofibrillary tangles and plaque-associated dystrophic neurites in AD brains, which suggested that it is involved in the pathology of AD [40]. All the findings indicate that DPP10 is associated with a risk to develop AD in a direct or indirect manner. THSD7B has been reported to be associated with age-related cognitive decline based on repeated measures of 17 cognitive tests [41]. In addition, several linkage mappings have identified VPS26A to be associated with AD [42]. Sidekick cell adhesion molecule 1 was reported as a susceptibility gene for hypertension in Japanese individuals [43], where hypertension moderately increased risk of AD [44].
For comparison purposes, we conducted two different types of GWASs: (1) a simple GWAS, multiple testing on each SNP and (2) kernel-machine SNP-set GWAS with the linear kernel [19]. Fig. 3. shows a Manhattan plot with –log 10(p-value) for the simple GWAS. The solid and dotted lines represent the genome-wide significance level and the suggestive significance level, respectively. Our study identified 9 SNPs at the 1 × 10-5 suggestive significance level, where none of them had been reported in previous GWASs. Supplementary Table 1 shows the p-values of 106 SNP selected by the element-wise variable selection of the proposed method. None of them were significant at the suggestive significance level. For the kernel-machine method, we considered three types of kernel functions such as the linear kernel, the identical by state (IBS) kernel, and the quadratic kernel. Table 2 shows the SNP-sets selected by the kernel-machine method at the 5 × 10-8 significance level. The selected SNP-sets vary with the type of kernel. The linear kernel, the IBS kernel, and the quadratic kernel selected 5, 8, and 6 SNP-sets, respectively. Two genes, calmodulin-binding transcription activator 1 (CAMTA1) and RBFOX1, were related to Alzheimer’s disease. The linear kernel selected an SNP set located within CAMTA1. Huentelman et al. [45] identified SNPs within the CAMTA1 gene that were significantly related to memory performance and memory-related regions on the human brain, which could be considered potential biomarkers of AD. RBFOX1 was identified under the quadratic kernel function. Hooli et al. [46] reported that the gene co-segregates with disease status within early-onset familial AD and early or mixed-onset AD families. There were no overlapped SNPs among the three methods.
Simulation study
As the AFT model for non-censored data is the log-normal regression model, we can compare the performance of variable selection with other variable selection methods implemented based on the typical regression models. For competing methods, we considered the group Lasso (grLasso) [47], the group MCP(grMCP) [48], the group bridge (gBridge) [49], the group exponential lasso (gel) [50], the composite MCP (cMCP) [51] penaltues. The cMCP, gel, and gBridge penalties carry out bi-level selection, meaning that they carry out variable selection at the group level and at the level of individual covariates. The grLasso, grMCP, and grSCAD penalties carry out variable selection only at the group level, meaning that within a group, coefficients will either all be zero or all non-zero. We used Bayesian Information Criteria to select the tuning parameter value for each method.
We consider the following performance measurements: true positive rate (TPR or sensitivity), true negative rate (TNR or specificity), positive predictive value (PPV), and negative predictive value (NPV). They are defined as follows.
where the TP and TN are the number of correctly identified significant variables and the number of correctly rejected non-significant variables, respectively. The FP and FN denote the number of identified non-significant variables and the number of rejected significant variables, respectively. Under the true model, TP = 10, TN = 1990, and FP = FN = 0, which implies that all the four rates are equal to one.
Table 3 shows the group-level and element-level variable selection results for the non-censored case. The average values of the performance measurements are presented with Monte Carlo standard errors in the parenthesis. Our method achieves the highest values of all the criteria,TPR, TNR, NPV, and PPV compared with other group penalty methods by removing the irrelevant groups consistently and selecting important groups very well. As the group penalties with only group-level selection especially grSCAD, grLasso tend to select groups more generously, they select important groups perfectly while the numbers of true positive cases are much bigger than other methods. The bi-level selection penalties, gBridge, and gel show comparative performance to our proposed method. In terms of the element-wise variable selection, our method yields the highest values of all the criteria,TPR, TNR, NPV, and PPV compared with other group penalty methods enabling bi-level selection. As the important signals are sparse, all the bi-level methods perform very well in terms of removing irrelevant signals.
The BBVS also shows satisfactory performance in terms of selecting important variables in the censored case. The average values of TPR, TNR, PPV, and NPV for the group-level selection from the 50 repetition of the simulation are 0.970, 1.000, 1.000, and 1. The corresponding Monte Carlo standard errors are 0.009, 0.000, 0.000, and 0.000. The average values of TPR, TNR, PPV, and NPV for the element-level selection are 0.634, 0.999, 0.589, and 1.000. The corresponding Monte Carlo standard errors are 0.012, 0.000, 0.011, and 0.000. Compared with non-censored cases, the performance of BBVS is satisfactory in the censored case as well.
Conclusion
The BBVS was developed to enable bi-level variable selection as incorporating grouping information within covariates in the high-dimensional setting. In the context of GWAS, our method addressed the challenging issues by making use of natural grouping information of SNPs in the group-level variable selection step. In addition, DL priors were adapted to reflect the grouping information in the element-wise variable selection.
The simulation studies showed that our proposed method outperformed other bi-level and group-level variable selection methods in the GWAS setting for a non-censored case. We applied BBVS on the ADNI-1 data to identify relevant SNP-sets associated with the time to develop AD within MCI patients. We identified 106 informative SNPs located within 10 genes, where four genes were directly and indirectly related to AD, while the simple form of GWAS only detected 3 SNPs that had not been reported in the literature. We need to analyze other AD data sets to see if the implicated genes are reproducible when we used different subjects in the future study. We also need to conduct a simulation study to compare the variable selection performance of BBVS with other survival models that enable variable selection for the high-dimensional data.
Notes
Authors’ Contribution
Conceptualization: EL, JGI, HZ. Data curation: EL. Formal analysis: EL. Funding acquisition: EL, JGI, HZ. Methodology: EL, JGI, HZ. Writing – original draft: EL. Writing – review & editing: EL, JGI, HZ.
Conflicts of Interest
No potential conflict of interest relevant to this article was reported.
Acknowledgements
This material was based on work partially supported by the National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT) (No. NRF-2022M3J6A1084843, No. NRF-2021R1C1C1013936). This work was also partially supported by the Institute of Information & communications Technology Planning & Evaluation (IITP) grant (No. 2020-0-01441, No.RS-2022-00155857, Artificial Intelligence Convergence Research Center (Chungnam National University)). Part of this study has been published as a PhD thesis by the first author under the supervision of the co-authors (Lee E. Advanced Bayesian models for high-dimensional biomedical data. Ph.D. Dissertation. Chapel Hill: The University of North Carolina, 2016).
Supplementary Materials
Supplementary data can be found with this article online at http://www.genominfo.org.