^{1}

^{2}

^{*}

^{1}

Despite the success of recent genome-wide association studies investigating longitudinal traits, a large fraction of overall heritability remains unexplained. This suggests that some of the missing heritability may be accounted for by gene-gene and gene-time/environment interactions. In this paper, we develop a Bayesian variable selection method for longitudinal genetic data based on mixed models. The method jointly models the main effects and interactions of all candidate genetic variants and non-genetic factors and has higher statistical power than previous approaches. To account for the within-subject dependence structure, we propose a grid-based approach that models only one fixed-dimensional covariance matrix, which is thus applicable to data where subjects have different numbers of time points. We provide the theoretical basis of our Bayesian method and then illustrate its performance using data from the 1000 Genome Project with various simulation settings. Several simulation studies show that our multivariate method increases the statistical power compared to the corresponding univariate method and can detect gene-time/environment interactions well. We further evaluate our method with different numbers of individuals, variants, and causal variants, as well as different trait-heritability, and conclude that our method performs reasonably well with various simulation settings.

In recent years, genome-wide association studies (GWAS) for longitudinal traits (e.g., body weight or cholesterol levels) have been carried out in cohorts, where multiple measurements have been collected from each individual [

There are methodological challenges associated with the genetic analysis of longitudinal traits for multiple variants. Most complex traits are typically controlled by multiple variants that interact with each other or environmental factors. It may be exceedingly difficult to model all candidate variants with epistasis effect and gene-time/environment interactions for longitudinal traits because genetic data are generally high-dimensional relative to the number of samples. Bayesian multiple quantitative trait loci (QTL) mapping methods [

Several statistical methods have been proposed for dealing with within-subject variation. For data collected at the same time points across all individuals, the measured values at each time point can be treated as one variable. The data can then be treated as multivariate outcomes and jointly analyzed [

In this paper, we develop a Bayesian variable selection method for longitudinal data where phenotypes are not measured at a fixed set of time points for all samples. It jointly models the main and pairwise interactions of all candidate genetic variants. We propose a novel grid-based approach to parsimoniously model each subject's covariance matrix as a function of a covariance matrix defined on a set of pre-selected time points where each observed time point is mapped to its two adjacent grid time points via linear interpolation. This approach thus deals only with a covariance matrix of a fixed dimension. The covariance matrix is then modeled nonparametrically using the modified Cholesky decomposition of Chen and Dunson [

For our simulation studies, we utilized the whole-genome sequencing data from the 1000 Genome Project, which created a catalogue of common human variations using samples from people who provided open consent who declared themselves healthy. It ran between 2008 and 2015, generating a large public catalogue of human variations and genotype data. We randomly selected 400 out of 504 individuals of East Asian (EAS) ancestry from the 1000 Genome Project data (phase 3 version 5) and then removed single-nucleotide polymorphisms (SNPs) with a minor allele frequency <5% and p(Hardy-Weinberg equilibrium) <10^{-6}, which resulted in 6, 247, 288 SNPs.

For a given trait, suppose we have _{i}_{1}, ..., _{d}^{T} as the SNP positions associated with the above genetic effects, where _{1}, ..., _{d}^{T} for the selection of genetic effects to be included in (_{i}_{i}_{ti}_{i}_{gi}_{i}_{ggi}_{gti}_{i}_{i}_{ti}_{gi}_{ggi}_{gti}

Given _{i}

where _{i}_{i1}, ..., _{ini}^{T} is an _{i}_{i}_{ni} is an _{i}_{q} (i.e., the model always contains all non-genetic covariates) and lower diagonal elements _{i}_{i}_{i}^{2}I_{ni}_{1}, ..., _{k}^{T}, and define _{i}_{i}_{1}, ..., _{n}_{i}_{i}_{r}_{r+1} (_{r}_{r+1}), we set _{r}_{ij}_{ij}_{ij1}_{1}+...+_{ijk}e_{k}_{ijr}_{ij}_{r}_{ijr}

For Bayesian estimation of the mixed model (1), we factor ^{T}_{1}, ..., _{k}_{lm}_{l}_{ll}_{lm}^{T}Δ

where _{i}_{i1}, ..._{ik}^{T} such that _{ij}_{i}_{i}ΔΨ_{i1}, ..., _{ini}^{T} and _{1}, ..., _{n}

Model identifiability is a property that a model must satisfy for accurate inference to be possible. A model is identifiable if it is theoretically possible to estimate the true values of the underlying parameters of the model, while a model is non-identifiable or unidentifiable if two or more parametrizations are observationally equivalent [^{T}+^{2}I_{N}_{n}_{ijr}

Lemma 1 and Theorem 1 enable us to check whether a given model is identifiable. A toy example is provided below. Suppose there are 3 grid points that produce 2 time intervals. According to the theorem, the rank of ^{T}+^{2}I_{N}^{2}=0 and modeling

For the random effects of the proposed Bayesian model, we employ the priors presented by Chen and Dunson [

Let _{a}_{a}_{a}

In model (2), we let the distribution of each _{ij}_{l}^{T} and _{ml}^{T}. The prior distribution for _{0} and _{0} are pre-specified hyperparameters.

The prior for the ^{2} distribution, _{β}_{β}_{a}_{i}_{a}_{a}_{a}^{2} is chosen as an scaled inverse ^{2} distribution,

The joint posterior distribution is proportional to the product of the likelihood and the prior distributions of all unknown parameters, which can be expressed as

where ^{2})^{T}. To obtain MCMC samples of all parameters, we utilize the Metropolis-Hastings and Gibbs sampling algorithms, and alternately update each unknown parameter or set of unknown parameters conditional on all the other parameters and the observed data.

For _{-f} represents all the elements of ^{*}, ^{*}, and ^{2} are ^{*},

The posterior samples can be used to approximate the posterior distribution of the parameters. MCMC samples from the initial iterations are discarded as “burn-in” and the subsequent samples are thinned by keeping every _{l}_{l}

The Bayes factor _{l}_{l}_{l}_{l}

A critical issue with the proposed Bayesian model is how to choose an optimal number of grid points, _{D}_{D}

The proposed grid-based Bayesian mixed models have been implemented in an R package named gridbayes [

To evaluate the performance of the proposed method, we conducted the following simulations. We first used 400 individuals and 1, 000 SNPs from the 1000 Genome Project data. The number of measurements for each individual ranged from 3 to 7 and the total number of observations was set to 2, 000. Six different setups (Setups 1‒6) were considered. We simulated the datasets containing 10 causal SNPs, which are randomly selected (i.e., the proportion of causal SNPs = 1%) with only main effects (Setup 1). For individual _{ia}_{g1} is used to set trait-heritability to 40%, _{i}_{i1}, ..., _{ini}^{T} were the time covariates generated from the uniform distribution _{i}^{2}_{ni}^{2} =1. The true number of grid points was set to 3 (i.e., true _{i}_{i}_{1}, _{2}, _{3}) = (1, 1.2, 0.8) and _{21}, _{31}, _{32})=(0.6, 0.4, 0.6). That is, _{i}_{21}, _{31}, _{32})=(0.72, 0.32, 0.81). The prior distributions for the elements in ^{+} (0, 30) and the prior distributions for the elements in ^{5} iterations after discarding the first 1, 000 burn-in iterations. The remaining samples were further thinned for every 40 iterations, yielding 10^{4} MCMC samples for the posterior analysis.

To further investigate the Bayesian mixed model, we analyzed additional datasets containing two SNP-SNP interactions (Setup 2), five SNP-SNP interactions (Setup 3), two SNP-time interactions (Setup 4), five SNP-time interactions (Setup 5), or ten SNP-time interactions (Setup 6). Specifically, we simulated data according to the following models: _{gj}

The one-dimensional genome-wide profiles of 2

To evaluate the performance of our Bayesian model, we further calculated the receiver operating characteristic (ROC) curves. For each setup, we conducted 100 simulations. The ROC curves with a false-positive rate less than 0.2 are presented in

To diagnose the convergence of the MCMC samples, we conducted 10 parallel chains with different, over-dispersed initial values with respect to the true posterior distribution. Using 10^{4} iterations, Geweke's Z-scores [^{2}, _{1}, _{2}, _{3}, _{21}, _{31} and _{32} for each setup, showing that all chains moved around the true values for all parameters, indicating good convergence. We plotted the marginal posterior and prior densities of all parameters based on 10, 000 random draws (^{2}, _{1}, _{2}, _{3}, _{21}, _{31} and _{32} for each setup. Most of the 95% HPD intervals contained the corresponding true values.

We conducted another simulation to estimate the number of true grid points using the DIC [_{ia}_{i}_{i1}, ..., _{ini}^{T} are the time points of the _{1}, _{2}, _{3}, _{4})=(1, 1.2, 0.8, 0.7) and (_{21}, _{31}, _{32}, _{41}, _{42}, _{43})=(0.6, 0.4, 0.6, 0.2, 0.4, 0.6).

For a more detailed evaluation of our Bayesian method, we conducted the following simulations with 100 replications for each scenario. We first considered 400 individuals with three to seven time points, resulting in 2,000 observations, and decreased the sample size from 400 to 100 to assess the effect of sample size in ROC curves (_{ia}_{i}_{i1}, ..., t_(i_{i}^{T} are the time points of the ^{2} = 1, _{1}, _{2}, _{3}) = (1, 1.2, 0.8), _{21}, _{31}, _{32})=(0.6, 0.4, 0.6). The trait-heritability was set to 40%. As the sample size decreased from 400 to 100, the true positive rates decreased in ROC curves, indicating that including more samples increased the true positive rates with fixed false positive rates. Next, we evaluated the effect of the number of SNPs (

We developed a grid-based Bayesian mixed model for longitudinal genetic data with a built-in variable selection feature. The proposed Bayesian method modeled multiple candidate SNPs simultaneously and allowed SNP-SNP and SNP-time interactions, which enabled us to identify SNPs with time-varying effects. Such SNPs are of great scientific and medical interest. In addition, we proposed a new grid-based method to model the covariance structure nonparametrically. Not only is the proposed method parsimonious in estimating the covariance matrix, but also by employing a reasonable number of grid-points, it can flexibly approximate any type of covariance structure. The number of grid points was pre-set, but DIC and simplified BPIC can be used to select the optimal number.

The simulation studies showed that the proposed Bayesian method using all time points outperformed the ordinary Bayesian method with one or all time points included. As expected, the proposed method that utilized the full data was more powerful than the corresponding univariate analysis method that only used a subset of the data. Furthermore, the proposed Bayesian method performed better than the ordinary Bayesian method because our method modeled the within-subject correlation. Further simulation studies showed that statistical power increased as the data had more samples, a smaller number of SNPs, a lower proportion of causal SNPs, and larger trait-heritability. For our simulation studies, we utilized data from the 1000 Genome Project. With only 400 independent samples of EAS ancestry, we restricted out analysis with up to 5, 000 SNPs. With a sufficient sample size, our method can be applied to all available SNPs. We are currently developing a parallel computing algorithm based on the message passing interface to execute multiple groups of SNPs simultaneously. This will make it feasible to apply our method to large-sample GWAS data.

Another important issue to mention is Bayesian model identifiability. In the Bayesian community, there is a wide diversity of views on the identifiability issue. Lindley [

Conceptualization: WC. Data curation: WC, YC. Formal analysis: WC, YC. Funding acquisition: WC. Methodology: WC. Writing - original draft: WC, YC. Writing - review & editing: WC.

No potential conflict of interest relevant to this article was reported.

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (2020R1C1C1A01012657) and Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (2021R1A6A1A10044154). This work was supported by the Soongsil University Research Fund.

Supplementary data can be found with this article online at

Time-dependent curves of averaged phenotype values for three different genotypes (0, 1, 2) at 1st causal single nucleotide polymorphism (SNP) with no SNP-time interaction and 10th causal SNP with SNP-time interaction for Setups 4 and 5.

Trace plots of ^{2}, _{1}, _{2}, _{3}, _{21}, _{31} and _{22} for Setups 1‒6 in the simulation study.The black lines represent the values of the draws for all parameters at each iteration and gray lines represent the true values of the parameters.

Posterior (solid line) and prior (dashed line) densities of the parameters for random errors and random effects for Setups 1â€’6. Estimated densities are based on 10, 000 random draws.

95% highest posterior density (HPD) intervals for ^{2}, _{1}, _{2}, _{3}, _{21}, _{31} and _{22} for Setups 1‒6. The blue lines represent the 95% HPD intervals (100 replicates).

Posterior means, medians, standard deviations and 95% HPD intervals of the parameters for random errors and random effects in the simulation study for sample size

Posterior means, medians, standard deviations, and 95% HPD intervals of the parameters for random errors and random effects in the simulation study for number of SNPs

Posterior means, medians, standard deviations, and 95% HPD intervals of the parameters for random errors and random effects in the simulation study for proportion of causal SNPs

Posterior means, medians, standard deviations, and 95% HPD intervals of the parameters for random errors and random effects in the simulation study for heritability

Average DIC scores and simplified BPIC scores over 100 replications in the simulation study for sample size, number of SNPs, proportion of causal SNPs and heritability

Genome-wide profiles of 2

Genome-wide profiles of 2

Receiving operating characteristic curve analyses in the simulation study for Setup 1 (A), 2 (B), 3 (C), 4 (D), 5 (E) and 6 (F). The red solid lines represent the results of gridbayes, the blue dot-dashed lines correspond to qtlbim-sub and the green long-dashed lines display the results from qtlbim-all. gridbayes: grid-based Bayesian mixed models with all the data; qtlbim-sub qtlbim with a subset of each simulated data where only one measurement from each subject was randomly selected; qtlbim-all: qtlbim with all the data by (incorrectly) assuming that all the measurements were independent. G, gene; T, time.

Receiving operating characteristic curve analyses in the simulation study for sample size, number of single-nucleotide polymorphisms (SNPs), proportion of causal SNPs, and heritability. (A) We decreased the sample size from ^{2} = 40% trait-heritability. (B) We increased the number of SNPs from ^{2} = 40% trait-heritability. (C) We increased the proportion of causal SNPs from c = 1% to 5%. The simulation data contained N = 2, 000 observations, ^{2} = 40% trait-heritability. (D) We decreased the trait-heritability from h^{2} = 40% to h^{2} = 10%. The simulation data contained N = 2, 000 observations,

Posterior means, medians, standard deviations, and 95% HPD intervals of the parameters for random errors and random effects in the simulation study

Setup | Par | True | Mean | Med | SD | 95% HPD | Setup | Par | True | Mean | Med | SD | 95% HPD |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

1 | ^{2} |
1 | 1.01 | 1.01 | 0.04 | 0.92 to 1.09 | 2 | ^{2} |
1 | 1.01 | 1.01 | 0.04 | 0.93 to 1.10 |

_{1} |
1 | 0.98 | 0.98 | 0.11 | 0.77 to 1.19 | _{1} |
1 | 0.98 | 0.98 | 0.10 | 0.78 to 1.19 | ||

_{2} |
1.2 | 1.19 | 1.19 | 0.13 | 0.92 to 1.43 | _{2} |
1.2 | 1.19 | 1.19 | 0.13 | 0.92 to 1.43 | ||

_{3} |
0.8 | 0.76 | 0.76 | 0.13 | 0.5 to 1.00 | _{3} |
0.8 | 0.72 | 0.73 | 0.13 | 0.48 to 0.97 | ||

_{21} |
0.6 | 0.65 | 0.63 | 0.21 | 0.31 to 1.13 | _{21} |
0.6 | 0.67 | 0.64 | 0.21 | 0.33 to 1.14 | ||

_{31} |
0.4 | 0.52 | 0.51 | 0.24 | 0.09 to 1.04 | _{31} |
0.4 | 0.51 | 0.49 | 0.24 | 0.07 to 1.01 | ||

_{32} |
0.6 | 0.56 | 0.53 | 0.27 | 0.11 to 1.18 | _{32} |
0.6 | 0.67 | 0.64 | 0.29 | 0.20 to 1.34 | ||

3 | ^{2} |
1 | 1.00 | 1.00 | 0.04 | 0.92 to 1.08 | 4 | ^{2} |
1 | 1.00 | 1.00 | 0.04 | 0.92 to 1.09 |

_{1} |
1 | 1.06 | 1.06 | 0.10 | 0.85 to 1.26 | _{1} |
1 | 0.99 | 0.99 | 0.10 | 0.78 to 1.19 | ||

_{2} |
1.2 | 1.20 | 1.20 | 0.13 | 0.94 to 1.44 | _{2} |
1.2 | 1.18 | 1.19 | 0.13 | 0.92 to 1.42 | ||

_{3} |
0.8 | 0.74 | 0.74 | 0.12 | 0.49 to 0.98 | _{3} |
0.8 | 0.74 | 0.74 | 0.13 | 0.48 to 0.98 | ||

_{21} |
0.6 | 0.69 | 0.67 | 0.20 | 0.36 to 1.14 | _{21} |
0.6 | 0.62 | 0.60 | 0.20 | 0.29 to 1.07 | ||

_{31} |
0.4 | 0.65 | 0.64 | 0.24 | 0.22 to 1.17 | _{31} |
0.4 | 0.47 | 0.45 | 0.24 | 0.03 to 0.96 | ||

_{32} |
0.6 | 0.65 | 0.62 | 0.27 | 0.19 to 1.27 | _{32} |
0.6 | 0.65 | 0.61 | 0.29 | 0.18 to 1.31 | ||

5 | ^{2} |
1 | 1.00 | 1.00 | 0.04 | 0.92 to 1.09 | 6 | ^{2} |
1 | 1.00 | 1.00 | 0.04 | 0.92 to 1.09 |

_{1} |
1 | 0.98 | 0.98 | 0.10 | 0.77 to 1.18 | _{1} |
1 | 0.96 | 0.96 | 0.11 | 0.75 to 1.17 | ||

_{2} |
1.2 | 1.20 | 1.20 | 0.13 | 0.93 to 1.43 | _{2} |
1.2 | 1.18 | 1.19 | 0.13 | 0.92 to 1.41 | ||

_{3} |
0.8 | 0.72 | 0.72 | 0.13 | 0.47 to 0.97 | _{3} |
0.8 | 0.75 | 0.75 | 0.13 | 0.48 to 1.01 | ||

_{21} |
0.6 | 0.63 | 0.60 | 0.20 | 0.29 to 1.09 | _{21} |
0.6 | 0.61 | 0.58 | 0.20 | 0.27 to 1.07 | ||

_{31} |
0.4 | 0.46 | 0.45 | 0.25 | 0.01 to 0.99 | _{31} |
0.4 | 0.32 | 0.31 | 0.24 | –0.14 to 0.81 | ||

_{32} |
0.6 | 0.65 | 0.61 | 0.29 | 0.18 to 1.31 | _{32} |
0.6 | 0.67 | 0.63 | 0.30 | 0.19 to 1.35 |

HPD, highest posterior density; Par, parameters; True, true values of parameters; Med, median; SD, standard deviation.

Average DIC scores and simplified BPIC scores over 100 replications and the proportion selecting the model with the correct number of grid points using the proposed Bayesian model

True k | k | Avg DIC | #Sel (%) | Avg Sim BPIC | #Sel (%) | Avg _{D} |
---|---|---|---|---|---|---|

2 | 2 | 6,614.47 | 79 | 6,681.78 | 94 | 67.32 |

3 | 6,617.97 | 15 | 6,690.51 | 6 | 72.54 | |

4 | 6,623.02 | 6 | 6,700.99 | 0 | 77.98 | |

3 | 2 | 6,745.26 | 0 | 6,812.49 | 0 | 67.23 |

3 | 6,696.53 | 91 | 6,769.83 | 98 | 73.30 | |

4 | 6,707.12 | 9 | 6,785.14 | 2 | 78.02 | |

4 | 2 | 6,745.07 | 0 | 6,814.76 | 0 | 69.70 |

3 | 6,718.66 | 0 | 6,792.75 | 7 | 74.09 | |

4 | 6,695.41 | 100 | 6,775.37 | 93 | 79.96 |

DIC, deviance information criterion; BPIC, Bayesian predictive information criterion; Avg DIC, average deviance information criterion scores over 100 replications; #Sel (%), proportion selecting the model with the correct number of grid points; Avg Sim BPIC, average simplified Bayesian predictive information criterion scores over 100 replications; Avg PD, average _{D}

Simulation settings for all simulation setups in the Results section based on genetic effect terms, the number of grid points (k), sample size (n), number of observations (N), number of SNPs (p), number of causal SNPs (c), and trait-heritability (h^{2})

Simulation | Setup | Genetic effect terms | k | n | N | p | c (%) | h^{2 }(%) |
---|---|---|---|---|---|---|---|---|

I | 1 | Only main effects | 3 | 400 | 2,000 | ,000 | 1 | 40 |

2 | Two SNP-SNP interactions | 3 | 400 | 2,000 | 1,000 | 1 | 40 | |

3 | Five SNP-SNP interactions | 3 | 400 | 2,000 | 1,000 | 1 | 40 | |

4 | Two SNP-time interactions | 3 | 400 | 2,000 | 1,000 | 1 | 40 | |

5 | Five SNP-time interactions | 3 | 400 | 2,000 | 1,000 | 1 | 40 | |

6 | Ten SNP-time interactions | 3 | 400 | 2,000 | 1,000 | 1 | 40 | |

II | 1 | Only main effects | 2 | 400 | 2,000 | 1,000 | 1 | 40 |

2 | Only main effects | 3 | 400 | 2,000 | 1,000 | 1 | 40 | |

3 | Only main effects | 4 | 400 | 2,000 | 1,000 | 1 | 40 | |

III (a) | 1 | Only main effects | 3 | 300 | 1,500 | 1,000 | 1 | 40 |

2 | Only main effects | 3 | 200 | 1,000 | 1,000 | 1 | 40 | |

3 | Only main effects | 3 | 100 | 500 | 1,000 | 1 | 40 | |

III (b) | 1 | Only main effects | 3 | 400 | 2,000 | 2,000 | 1 | 40 |

2 | Only main effects | 3 | 400 | 2,000 | 3,000 | 1 | 40 | |

3 | Only main effects | 3 | 400 | 2,000 | 5,000 | 1 | 40 | |

III (c) | 1 | Only main effects | 3 | 400 | 2,000 | 1,000 | 2 | 40 |

2 | Only main effects | 3 | 400 | 2,000 | 1,000 | 3 | 40 | |

3 | Only main effects | 3 | 400 | 2,000 | 1,000 | 5 | 40 | |

III (d) | 1 | Only main effects | 3 | 400 | 2,000 | 1,000 | 1 | 40 |

2 | Only main effects | 3 | 400 | 2,000 | 1,000 | 1 | 30 | |

3 | Only main effects | 3 | 400 | 2,000 | 1,000 | 1 | 20 |

SNP, single nucleotide polymorphism.