Bayesian mixed models for longitudinal genetic data: theory, concepts, and simulation studies
Article information
Abstract
Despite the success of recent genome-wide association studies investigating longitudinal traits, a large fraction of overall heritability remains unexplained. This suggests that some of the missing heritability may be accounted for by gene-gene and gene-time/environment interactions. In this paper, we develop a Bayesian variable selection method for longitudinal genetic data based on mixed models. The method jointly models the main effects and interactions of all candidate genetic variants and non-genetic factors and has higher statistical power than previous approaches. To account for the within-subject dependence structure, we propose a grid-based approach that models only one fixed-dimensional covariance matrix, which is thus applicable to data where subjects have different numbers of time points. We provide the theoretical basis of our Bayesian method and then illustrate its performance using data from the 1000 Genome Project with various simulation settings. Several simulation studies show that our multivariate method increases the statistical power compared to the corresponding univariate method and can detect gene-time/environment interactions well. We further evaluate our method with different numbers of individuals, variants, and causal variants, as well as different trait-heritability, and conclude that our method performs reasonably well with various simulation settings.
Introduction
In recent years, genome-wide association studies (GWAS) for longitudinal traits (e.g., body weight or cholesterol levels) have been carried out in cohorts, where multiple measurements have been collected from each individual [1-7]. Although GWAS have successfully discovered a large number of novel genetic variants associated with these traits, the identified variants typically account for only a small proportion of overall heritability [8-10]. A presumed explanation for the “missing heritability” is that existing methods have low power to identify gene-gene and gene-time/environment interactions [11]. Since traditional methodologies are limited to the identification of variants with marginal effects using a single measurement per individual, a large amount of useful information in longitudinal data is lost and variants that interact with other variants or have time-varying effects may not be detected [12]. It is more appropriate to analyze multiple variants simultaneously, using all available measurements, for longitudinal genetic studies.
There are methodological challenges associated with the genetic analysis of longitudinal traits for multiple variants. Most complex traits are typically controlled by multiple variants that interact with each other or environmental factors. It may be exceedingly difficult to model all candidate variants with epistasis effect and gene-time/environment interactions for longitudinal traits because genetic data are generally high-dimensional relative to the number of samples. Bayesian multiple quantitative trait loci (QTL) mapping methods [13-16] have been proposed for modeling epistatic effects. Multiple QTL can be simultaneously detected by treating the number of QTL as a random variable using the reversible jump Markov-chain Monte Carlo (MCMC) method [13,14]. Alternatively, multiple QTL can be viewed as a variable selection problem [15,16]. Bayesian model selection approaches are used for identifying QTL with main and epistatic effects [17], as well as QTL that interact with other covariates [18] based on the composite model space framework. These approaches use a fixed-dimensional parameter space by setting an upper bound on the number of detectable QTL and introduce latent binary variables for deciding which variables will be included in the model. This technique reasonably reduces the model space using efficient MCMC algorithms. For multiple QTL mapping with multivariate traits, Banerjee et al. [19] extended the Bayesian variable selection method of Yi [16] via a model that allows different genetic models for different traits. This method provides a multiple QTL mapping strategy for correlated traits, but it does not account for the dependence structure among repeated measurements from each subject.
Several statistical methods have been proposed for dealing with within-subject variation. For data collected at the same time points across all individuals, the measured values at each time point can be treated as one variable. The data can then be treated as multivariate outcomes and jointly analyzed [20-23]. For data collected at different time points across some or all individuals, the measured values cannot be effectively grouped; thus, standard multivariate analysis is no longer applicable. Alternatively, mixed models are used for longitudinal data to map QTL [24]. Mixed models are flexible in modeling such unbalanced data because they allow non-constant correlations among observations. Chung and Zou [25] developed a Bayesian multiple association-mapping algorithm based on a mixed model with a built-in variable selection feature. It models multiple genes simultaneously and allows gene-gene and gene-time/environment interactions for repeatedly measured phenotypes. However, in that model, we made the strong assumption that the covariance matrix is known up to a constant. We plan to relax that assumption here.
In this paper, we develop a Bayesian variable selection method for longitudinal data where phenotypes are not measured at a fixed set of time points for all samples. It jointly models the main and pairwise interactions of all candidate genetic variants. We propose a novel grid-based approach to parsimoniously model each subject's covariance matrix as a function of a covariance matrix defined on a set of pre-selected time points where each observed time point is mapped to its two adjacent grid time points via linear interpolation. This approach thus deals only with a covariance matrix of a fixed dimension. The covariance matrix is then modeled nonparametrically using the modified Cholesky decomposition of Chen and Dunson [26], which facilitates the use of normal conjugate priors. The deviance information criterion (DIC) and the Bayesian predictive information criterion (BPIC) are proposed for the selection of an optimal number of grid points. The paper is organized as follows. In the Methods section, we introduce a novel grid-based Bayesian method for longitudinal genetic data and provide its theoretical basis. In the Results section, we show numerous simulation results using whole-genome sequencing data from the 1000 Genome Project to evaluate the performance of the proposed methods and assess the effects of sample size, number of variants, causal variants, and heritability. We conclude the paper with some discussions on the proposed methods and future research.
Methods
Genotype data
For our simulation studies, we utilized the whole-genome sequencing data from the 1000 Genome Project, which created a catalogue of common human variations using samples from people who provided open consent who declared themselves healthy. It ran between 2008 and 2015, generating a large public catalogue of human variations and genotype data. We randomly selected 400 out of 504 individuals of East Asian (EAS) ancestry from the 1000 Genome Project data (phase 3 version 5) and then removed single-nucleotide polymorphisms (SNPs) with a minor allele frequency <5% and p(Hardy-Weinberg equilibrium) <10-6, which resulted in 6, 247, 288 SNPs.
Bayesian mixed models
For a given trait, suppose we have n individuals where individual i has phenotypes measured at ni time points (i=1, ..., n) and p SNPs. Let
Given γ, λ, and xi, we consider the following mixed model:
where yi=(yi1, ..., yini)T is an ni×1 phenotype vector of individual i; μi=μ1ni is an ni×1 overall mean vector; Γ is a diagonal matrix with upper diagonal elements 1q (i.e., the model always contains all non-genetic covariates) and lower diagonal elements
Re-parameterized model
For Bayesian estimation of the mixed model (1), we factor D, the covariance matrix of the random effects, by employing the modified Cholesky decomposition of Chen and Dunson [26]. Let L denote a k×k lower triangular Cholesky decomposition matrix that has nonnegative diagonal elements, such that D = LLT. Let L=ΔΨ, where Δ=diag(δ1, ..., δk) and Ψ is a k×k matrix with the (l, m)th element denoted by ψlm. To make Δ and Ψ identifiable, we make the following assumptions: δl≥ 0, ψll=1 and ψlm=0, for l=1, ..., k; m=l+1, ..., k. These conditions make Δ a nonnegative k× k diagonal matrix and Ψ a lower triangular matrix with 1's in the diagonal elements. This leads to the decomposition D =ΔΨΨTΔ, and thus we reparametrize model (1) as
where bi=(bi1, ...bik)T such that bij∼N(0, 1) and
Model identifiability
Model identifiability is a property that a model must satisfy for accurate inference to be possible. A model is identifiable if it is theoretically possible to estimate the true values of the underlying parameters of the model, while a model is non-identifiable or unidentifiable if two or more parametrizations are observationally equivalent [27]. The proposed Bayesian model has an identifiability issue associated with the covariance matrix of
Lemma 1 and Theorem 1 enable us to check whether a given model is identifiable. A toy example is provided below. Suppose there are 3 grid points that produce 2 time intervals. According to the theorem, the rank of A must be
Prior specifications
For the random effects of the proposed Bayesian model, we employ the priors presented by Chen and Dunson [26]. Specifically, independent half normal priors are imposed on the diagonal elements of Δ and normal priors on the lower triangular elements of Ψ. For the fixed effects, we straightforwardly extend the priors presented in Yi et al. [17, 18].
Priors on γ and λ
Let wa = P(γa = 1) be the inclusion probability of the ath genetic effect. We assume that all inclusion probabilities are independent of each other and thus the prior of γ is
Priors on b, Δ, and Ψ
In model (2), we let the distribution of each bij independently follow a standard normal distribution. Thus, the joint prior distribution of
Priors on β, μ, and σ2
The prior for the ath genetic effect is a normal distribution,
Posterior calculation and MCMC algorithm
The joint posterior distribution is proportional to the product of the likelihood and the prior distributions of all unknown parameters, which can be expressed as
where θ=(λ, β, b, δ, ψ, μ, σ2)T. To obtain MCMC samples of all parameters, we utilize the Metropolis-Hastings and Gibbs sampling algorithms, and alternately update each unknown parameter or set of unknown parameters conditional on all the other parameters and the observed data.
For γ and λ, we use the Metropolis-Hastings algorithm within Gibbs sampler since their conditional distributions have no known distributional forms. To update those parameters, we straightforwardly extend the Metropolis-Hastings algorithm proposed by Yi et al. [18] for our Bayesian model. These algorithms are described in the Supplementary Data 1. For the other parameters, we applied the Gibbs sampling algorithm. Specifically, since b, δ, and ψ have multivariate normal or half normal priors, the full conditional distributions are easy to derive by their conjugacy properties. The full conditional posterior distributions of b, δ and ψ are
Posterior analysis
The posterior samples can be used to approximate the posterior distribution of the parameters. MCMC samples from the initial iterations are discarded as “burn-in” and the subsequent samples are thinned by keeping every cth MCMC sample, where c is an integer, and discarding the rest. The posterior inclusion probability of each SNP can be calculated using its inclusion proportion in the MCMC samples as
The Bayes factor BF(κl) reflects how our belief in the importance of the lth SNP changes as we move from prior knowledge to posterior information. Jeffreys [28] and Yandell et al. [29] suggest the following criteria for judging the significance of each SNP: weak support if BF(κl) falls between 3 and 10; moderate support if BF(κl)falls between 10 and 30; and strong support if BF(κl) is larger than 30.
Choice of the number of grid points
A critical issue with the proposed Bayesian model is how to choose an optimal number of grid points, k. We achieve this goal by evaluating the goodness of the predictive distributions of our Bayesian models. Spiegelhalter et al. [30] proposed the DIC as
Implementation in gridbayes
The proposed grid-based Bayesian mixed models have been implemented in an R package named gridbayes [33], which is built on top of the R packages, qtl [34] and qtlbim [29]. The MCMC algorithm in C and the data manipulation procedure in R were modified for longitudinal analysis. The gridbayes package employs both DIC and simplified BPIC scores to select the optimal number of grid points. The software package and the source code are available for download at https://github.com/wonilchung/GridBayes.
Results
Simulation I
To evaluate the performance of the proposed method, we conducted the following simulations. We first used 400 individuals and 1, 000 SNPs from the 1000 Genome Project data. The number of measurements for each individual ranged from 3 to 7 and the total number of observations was set to 2, 000. Six different setups (Setups 1‒6) were considered. We simulated the datasets containing 10 causal SNPs, which are randomly selected (i.e., the proportion of causal SNPs = 1%) with only main effects (Setup 1). For individual i, the phenotype values were generated from the model:
To further investigate the Bayesian mixed model, we analyzed additional datasets containing two SNP-SNP interactions (Setup 2), five SNP-SNP interactions (Setup 3), two SNP-time interactions (Setup 4), five SNP-time interactions (Setup 5), or ten SNP-time interactions (Setup 6). Specifically, we simulated data according to the following models:
The one-dimensional genome-wide profiles of 2log(BF) for the combined main, epistatic effects, and SNP-time interactions of each SNP under the six setups were presented in Figs. 1 and 2. The dashed vertical lines indicate the locations of the 10 causal SNPs. The gridbayes analysis of all the data and qtlbim-sub detected the causal SNPs reasonably well, but gridbayes clearly outperformed qtlbim-sub in general. The qtlbim-all method occasionally identified the true causal SNPs, but it produced far more false-positive findings than gridbayes and qtlbim-sub.

Genome-wide profiles of 2log(BF) for all combined effects using gridbayes with all time points, qtlbim with one randomly-selected time point (qtlbim-sub) and qtlbim with all time points (qtlbim-all) for Setups 1 (A), 2 (B), and 3 (C).

Genome-wide profiles of 2log(BF) for all combined effects using gridbayes with all time points, qtlbim with one randomly-selected time point (qtlbim-sub) and qtlbim with all time points (qtlbim-all) for Setups 4 (A), 5 (B), and 6 (C).
To evaluate the performance of our Bayesian model, we further calculated the receiver operating characteristic (ROC) curves. For each setup, we conducted 100 simulations. The ROC curves with a false-positive rate less than 0.2 are presented in Fig. 3. The solid lines represent the results of gridbayes, the dot-dashed lines correspond to qtlbim-sub and the results from qtlbim-all are summarized by the long-dashed lines. The ROC curves demonstrated that gridbayes with all measurements appeared to outperform the qtlbim analyses in terms of improved true positive rates.

Receiving operating characteristic curve analyses in the simulation study for Setup 1 (A), 2 (B), 3 (C), 4 (D), 5 (E) and 6 (F). The red solid lines represent the results of gridbayes, the blue dot-dashed lines correspond to qtlbim-sub and the green long-dashed lines display the results from qtlbim-all. gridbayes: grid-based Bayesian mixed models with all the data; qtlbim-sub qtlbim with a subset of each simulated data where only one measurement from each subject was randomly selected; qtlbim-all: qtlbim with all the data by (incorrectly) assuming that all the measurements were independent. G, gene; T, time.
To diagnose the convergence of the MCMC samples, we conducted 10 parallel chains with different, over-dispersed initial values with respect to the true posterior distribution. Using 104 iterations, Geweke's Z-scores [35] for each chain based on the first 10% and last 50% of the samples indicated good convergence of all parameters. Based on 10 chains, Gelman and Rubin's potential scale reduction factors [36] were calculated, and the upper limits were less than 1.01 for all parameters. Supplementary Fig. 2 presents the trace plots of σ2, δ1, δ2, δ3, ψ21, ψ31 and ψ32 for each setup, showing that all chains moved around the true values for all parameters, indicating good convergence. We plotted the marginal posterior and prior densities of all parameters based on 10, 000 random draws (Supplementary Fig. 3). It appeared that the random draws were approximately normal, with means close to the simulated values. Supplementary Fig. 4 displays the 95% highest posterior density (HPD) intervals for σ2, δ1, δ2, δ3, ψ21, ψ31 and ψ32 for each setup. Most of the 95% HPD intervals contained the corresponding true values. Table 1 summarizes the posterior estimates of all parameters. The posterior means and medians were close to the true values and all the 95% HPD intervals contained the true values, demonstrating the good performance of our algorithm.
Simulation II
We conducted another simulation to estimate the number of true grid points using the DIC [30] and simplified BPIC [32,37]. The settings were almost the same as those in the previous simulations, except that the true number of grid points now varied from 2 to 4 (i.e., true k=2, 3, 4). We simulated 100 datasets with 400 individuals and 1, 000 SNPs containing 10 causal SNPs (i.e., the proportion of causal SNPs = 1%) with only main effects. The causal SNPs were randomly assigned. The trait-heritability was set to 40%. The phenotype values were generated from the model:
Simulation III
For a more detailed evaluation of our Bayesian method, we conducted the following simulations with 100 replications for each scenario. We first considered 400 individuals with three to seven time points, resulting in 2,000 observations, and decreased the sample size from 400 to 100 to assess the effect of sample size in ROC curves (Fig. 4A). The simulation data contained 1,000 SNPs with 1% causal SNPs (i.e., 10 causal SNPs) with only main effects. The trait values were generated from the model:

Receiving operating characteristic curve analyses in the simulation study for sample size, number of single-nucleotide polymorphisms (SNPs), proportion of causal SNPs, and heritability. (A) We decreased the sample size from n = 400 (total number of observations, N = 2, 000) to n = 100 (N = 1, 000) for accessing the effect of sample size in receiver operating characteristic curves. The simulation data contained p = 1, 000 SNPs, c = 1% causal SNPs and h2 = 40% trait-heritability. (B) We increased the number of SNPs from p = 1, 000 to p = 5, 000 to evaluate the effect of number of SNPs. The simulation data contained N = 2, 000 observations, c = 1% causal SNPs, and h2 = 40% trait-heritability. (C) We increased the proportion of causal SNPs from c = 1% to 5%. The simulation data contained N = 2, 000 observations, p = 1, 000 SNPs, and h2 = 40% trait-heritability. (D) We decreased the trait-heritability from h2 = 40% to h2 = 10%. The simulation data contained N = 2, 000 observations, p = 1, 000 SNPs, and c = 1% causal SNPs.
Discussion
We developed a grid-based Bayesian mixed model for longitudinal genetic data with a built-in variable selection feature. The proposed Bayesian method modeled multiple candidate SNPs simultaneously and allowed SNP-SNP and SNP-time interactions, which enabled us to identify SNPs with time-varying effects. Such SNPs are of great scientific and medical interest. In addition, we proposed a new grid-based method to model the covariance structure nonparametrically. Not only is the proposed method parsimonious in estimating the covariance matrix, but also by employing a reasonable number of grid-points, it can flexibly approximate any type of covariance structure. The number of grid points was pre-set, but DIC and simplified BPIC can be used to select the optimal number.
The simulation studies showed that the proposed Bayesian method using all time points outperformed the ordinary Bayesian method with one or all time points included. As expected, the proposed method that utilized the full data was more powerful than the corresponding univariate analysis method that only used a subset of the data. Furthermore, the proposed Bayesian method performed better than the ordinary Bayesian method because our method modeled the within-subject correlation. Further simulation studies showed that statistical power increased as the data had more samples, a smaller number of SNPs, a lower proportion of causal SNPs, and larger trait-heritability. For our simulation studies, we utilized data from the 1000 Genome Project. With only 400 independent samples of EAS ancestry, we restricted out analysis with up to 5, 000 SNPs. With a sufficient sample size, our method can be applied to all available SNPs. We are currently developing a parallel computing algorithm based on the message passing interface to execute multiple groups of SNPs simultaneously. This will make it feasible to apply our method to large-sample GWAS data.
Another important issue to mention is Bayesian model identifiability. In the Bayesian community, there is a wide diversity of views on the identifiability issue. Lindley [38] remarked that non-identifiability causes no real difficulty in Bayesian approaches. Poirier [39] and Eberly and Carlin [40] argued that a Bayesian analysis of a non-identifiable model is always possible if priors on all of the parameters are proper, since proper priors yield proper posterior distributions, and hence every parameter can be well-estimated. However, if the priors imposed on any non-identifiable model are not proper, or too close to being improper, ill-behaved posterior distributions may be generated such that the trajectory of the parameters can drift to extreme values, as demonstrated by Gelfand and Sahu [41]. In this paper, we investigated the identifiability of our Bayesian model, which motivated us to utilize only proper priors (see the Methods section). Non-identifiability occurred when the number of the grid points equaled the number of observed time points (see Supplementary Data 1), but we found that the posterior distribution behaved well due to the proper priors employed.
Notes
Authors’ Contribution
Conceptualization: WC. Data curation: WC, YC. Formal analysis: WC, YC. Funding acquisition: WC. Methodology: WC. Writing - original draft: WC, YC. Writing - review & editing: WC.
Conflicts of Interest
No potential conflict of interest relevant to this article was reported.
Acknowledgements
This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (2020R1C1C1A01012657) and Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (2021R1A6A1A10044154). This work was supported by the Soongsil University Research Fund.
Supplementary Materials
Supplementary data can be found with this article online at http://www.genominfo.org.
Time-dependent curves of averaged phenotype values for three different genotypes (0, 1, 2) at 1st causal single nucleotide polymorphism (SNP) with no SNP-time interaction and 10th causal SNP with SNP-time interaction for Setups 4 and 5.
Trace plots of σ2, δ1, δ2, δ3, ψ21, ψ31 and ψ22 for Setups 1‒6 in the simulation study.The black lines represent the values of the draws for all parameters at each iteration and gray lines represent the true values of the parameters.
Posterior (solid line) and prior (dashed line) densities of the parameters for random errors and random effects for Setups 1‒6. Estimated densities are based on 10, 000 random draws.
95% highest posterior density (HPD) intervals for σ2, δ1, δ2, δ3, ψ21, ψ31 and ψ22 for Setups 1‒6. The blue lines represent the 95% HPD intervals (100 replicates).
Posterior means, medians, standard deviations and 95% HPD intervals of the parameters for random errors and random effects in the simulation study for sample size
Posterior means, medians, standard deviations, and 95% HPD intervals of the parameters for random errors and random effects in the simulation study for number of SNPs
Posterior means, medians, standard deviations, and 95% HPD intervals of the parameters for random errors and random effects in the simulation study for proportion of causal SNPs
Posterior means, medians, standard deviations, and 95% HPD intervals of the parameters for random errors and random effects in the simulation study for heritability
Average DIC scores and simplified BPIC scores over 100 replications in the simulation study for sample size, number of SNPs, proportion of causal SNPs and heritability