# Bayesian mixed models for longitudinal genetic data: theory, concepts, and simulation studies

## Article information

## Abstract

Despite the success of recent genome-wide association studies investigating longitudinal traits, a large fraction of overall heritability remains unexplained. This suggests that some of the missing heritability may be accounted for by gene-gene and gene-time/environment interactions. In this paper, we develop a Bayesian variable selection method for longitudinal genetic data based on mixed models. The method jointly models the main effects and interactions of all candidate genetic variants and non-genetic factors and has higher statistical power than previous approaches. To account for the within-subject dependence structure, we propose a grid-based approach that models only one fixed-dimensional covariance matrix, which is thus applicable to data where subjects have different numbers of time points. We provide the theoretical basis of our Bayesian method and then illustrate its performance using data from the 1000 Genome Project with various simulation settings. Several simulation studies show that our multivariate method increases the statistical power compared to the corresponding univariate method and can detect gene-time/environment interactions well. We further evaluate our method with different numbers of individuals, variants, and causal variants, as well as different trait-heritability, and conclude that our method performs reasonably well with various simulation settings.

## Introduction

In recent years, genome-wide association studies (GWAS) for longitudinal traits (e.g., body weight or cholesterol levels) have been carried out in cohorts, where multiple measurements have been collected from each individual [1-7]. Although GWAS have successfully discovered a large number of novel genetic variants associated with these traits, the identified variants typically account for only a small proportion of overall heritability [8-10]. A presumed explanation for the “missing heritability” is that existing methods have low power to identify gene-gene and gene-time/environment interactions [11]. Since traditional methodologies are limited to the identification of variants with marginal effects using a single measurement per individual, a large amount of useful information in longitudinal data is lost and variants that interact with other variants or have time-varying effects may not be detected [12]. It is more appropriate to analyze multiple variants simultaneously, using all available measurements, for longitudinal genetic studies.

There are methodological challenges associated with the genetic analysis of longitudinal traits for multiple variants. Most complex traits are typically controlled by multiple variants that interact with each other or environmental factors. It may be exceedingly difficult to model all candidate variants with epistasis effect and gene-time/environment interactions for longitudinal traits because genetic data are generally high-dimensional relative to the number of samples. Bayesian multiple quantitative trait loci (QTL) mapping methods [13-16] have been proposed for modeling epistatic effects. Multiple QTL can be simultaneously detected by treating the number of QTL as a random variable using the reversible jump Markov-chain Monte Carlo (MCMC) method [13,14]. Alternatively, multiple QTL can be viewed as a variable selection problem [15,16]. Bayesian model selection approaches are used for identifying QTL with main and epistatic effects [17], as well as QTL that interact with other covariates [18] based on the composite model space framework. These approaches use a fixed-dimensional parameter space by setting an upper bound on the number of detectable QTL and introduce latent binary variables for deciding which variables will be included in the model. This technique reasonably reduces the model space using efficient MCMC algorithms. For multiple QTL mapping with multivariate traits, Banerjee et al. [19] extended the Bayesian variable selection method of Yi [16] via a model that allows different genetic models for different traits. This method provides a multiple QTL mapping strategy for correlated traits, but it does not account for the dependence structure among repeated measurements from each subject.

Several statistical methods have been proposed for dealing with within-subject variation. For data collected at the same time points across all individuals, the measured values at each time point can be treated as one variable. The data can then be treated as multivariate outcomes and jointly analyzed [20-23]. For data collected at different time points across some or all individuals, the measured values cannot be effectively grouped; thus, standard multivariate analysis is no longer applicable. Alternatively, mixed models are used for longitudinal data to map QTL [24]. Mixed models are flexible in modeling such unbalanced data because they allow non-constant correlations among observations. Chung and Zou [25] developed a Bayesian multiple association-mapping algorithm based on a mixed model with a built-in variable selection feature. It models multiple genes simultaneously and allows gene-gene and gene-time/environment interactions for repeatedly measured phenotypes. However, in that model, we made the strong assumption that the covariance matrix is known up to a constant. We plan to relax that assumption here.

In this paper, we develop a Bayesian variable selection method for longitudinal data where phenotypes are not measured at a fixed set of time points for all samples. It jointly models the main and pairwise interactions of all candidate genetic variants. We propose a novel grid-based approach to parsimoniously model each subject's covariance matrix as a function of a covariance matrix defined on a set of pre-selected time points where each observed time point is mapped to its two adjacent grid time points via linear interpolation. This approach thus deals only with a covariance matrix of a fixed dimension. The covariance matrix is then modeled nonparametrically using the modified Cholesky decomposition of Chen and Dunson [26], which facilitates the use of normal conjugate priors. The deviance information criterion (DIC) and the Bayesian predictive information criterion (BPIC) are proposed for the selection of an optimal number of grid points. The paper is organized as follows. In the Methods section, we introduce a novel grid-based Bayesian method for longitudinal genetic data and provide its theoretical basis. In the Results section, we show numerous simulation results using whole-genome sequencing data from the 1000 Genome Project to evaluate the performance of the proposed methods and assess the effects of sample size, number of variants, causal variants, and heritability. We conclude the paper with some discussions on the proposed methods and future research.

## Methods

### Genotype data

For our simulation studies, we utilized the whole-genome sequencing data from the 1000 Genome Project, which created a catalogue of common human variations using samples from people who provided open consent who declared themselves healthy. It ran between 2008 and 2015, generating a large public catalogue of human variations and genotype data. We randomly selected 400 out of 504 individuals of East Asian (EAS) ancestry from the 1000 Genome Project data (phase 3 version 5) and then removed single-nucleotide polymorphisms (SNPs) with a minor allele frequency <5% and p(Hardy-Weinberg equilibrium) <10^{-6}, which resulted in 6, 247, 288 SNPs.

### Bayesian mixed models

For a given trait, suppose we have *n* individuals where individual *i* has phenotypes measured at *n _{i}* time points (

*i*=1, ...,

*n*) and

*p*SNPs. Let

*p*, the number of SNP-SNP interaction terms to

*pq*, where

*q*is the number of covariates in the model, including time. We define

*λ*= (

*λ*

_{1}, ...,

*λ*)

_{d}^{T}as the SNP positions associated with the above genetic effects, where

*γ*=(

*γ*

_{1}, ...,

*γ*)

_{d}^{T}for the selection of genetic effects to be included in (

*γ*=1) or excluded from (

_{i}*γ*=0) the model. The vector (

_{i}*γ*,

*λ*) determines the number and positions of SNPs. For the

*i*th individual,

*x*denotes the

_{ti}*n*×

_{i}*q*design matrix of time/environmental covariates,

*x*denotes the

_{gi}*n*×

_{i}*p*design matrix of the

*p*SNPs,

*x*denotes the

_{ggi}*x*denotes the

_{gti}*n*×

_{i}*pq*design matrix of the SNP-time/SNP-environment interactions. We define the final design matrix as

*x*=(

_{i}*x*,

_{ti}*x*,

_{gi}*x*,

_{ggi}*x*).

_{gti}Given *γ*, *λ*, and *x _{i}*, we consider the following mixed model:

where *y _{i}*=(

*y*

_{i1}, ...,

*y*)

_{ini}^{T}is an

*n*×1 phenotype vector of individual

_{i}*i*;

*μ*=

_{i}*μ*1

_{ni}is an

*n*×1 overall mean vector;

_{i}*Γ*is a diagonal matrix with upper diagonal elements 1

_{q}(i.e., the model always contains all non-genetic covariates) and lower diagonal elements

*e*is an

_{i}*n*×1 vector of random errors with

_{i}*e*∼

_{i}*N*(0,

*σ*). To model the correlation among repeated measurements of the same individual, we partition the observed time interval by

^{2}I_{ni}*k*pre-specified grid points,

*t*=(

*t*

_{1}, ...,

*t*)

_{k}^{T}, and define

*ν*as a

_{i}*k*×1 vector of random effects at the grid time points with

*ν*∼

_{i}*N*(0,

*D*) where

*D*is a

*k*×

*k*covariance matrix. Let

*P*=

*diag*(

*p*

_{1}, ...,

*p*) where

_{n}*p*is defined as follows. If all subjects have

_{i}*k*observations measured exactly on the

*k*grid time points, then

*p*becomes an identity matrix. We apply an interpolation procedure (e.g., linear, polynomial, or spline) to any observation that does not fall on any of the

_{i}*k*grid time points. For simplicity, we choose a linear interpolation here. When the

*j*th measurement of individual

*i*falls at time

*t*, which is in between the grid points

*t*and

_{r}*t*

_{r+1}(

*t*≤

_{r}*t*≤

*t*

_{r+1}), we set

*t*=

*t*, we get

_{r}*p*as

_{ij}*p*=

_{ij}*a*

_{ij1}

*e*

_{1}+...+

*a*, where

_{ijk}e_{k}*a*is the

_{ijr}*r*th element of

*p*and

_{ij}*e*(1 ≤

_{r}*r*≤

*k*) is a 1×

*k*vector whose elements are all zero except the

*r*th component, which equals 1. Note that

*a*values can be non-zero due to the linear interpolation we employ here.

_{ijr}### Re-parameterized model

For Bayesian estimation of the mixed model (1), we factor *D*, the covariance matrix of the random effects, by employing the modified Cholesky decomposition of Chen and Dunson [26]. Let *L* denote a *k*×*k* lower triangular Cholesky decomposition matrix that has nonnegative diagonal elements, such that *D* = *LL ^{T}*. Let

*L*=

*ΔΨ*, where

*Δ*=diag(

*δ*

_{1}, ...,

*δ*) and

_{k}*Ψ*is a

*k*×

*k*matrix with the (

*l, m*)th element denoted by

*ψ*. To make

_{lm}*Δ*and

*Ψ*identifiable, we make the following assumptions:

*δ*≥ 0,

_{l}*ψ*=1 and

_{ll}*ψ*=0, for

_{lm}*l*=1, ...,

*k*;

*m*=l+1, ...,

*k*. These conditions make

*Δ*a nonnegative

*k*×

*k*diagonal matrix and

*Ψ*a lower triangular matrix with 1's in the diagonal elements. This leads to the decomposition

*D*=

*ΔΨΨ*, and thus we reparametrize model (1) as

^{T}Δwhere *b _{i}*=(

*b*

_{i1}, ...

*b*)

_{ik}^{T}such that

*b*∼

_{ij}*N*(0, 1) and

*j*=1, ...,

*k*. For later use, we define

*v*=

_{i}*p*=(

_{i}ΔΨ*v*

_{i1}, ...,

*v*)

_{ini}^{T}and

*v*=diag(

*v*

_{1}, ...,

*v*).

_{n}### Model identifiability

Model identifiability is a property that a model must satisfy for accurate inference to be possible. A model is identifiable if it is theoretically possible to estimate the true values of the underlying parameters of the model, while a model is non-identifiable or unidentifiable if two or more parametrizations are observationally equivalent [27]. The proposed Bayesian model has an identifiability issue associated with the covariance matrix of *PDP*^{T}+*σ ^{2}I_{N}* where

*D*=

*I*⊗

_{n}*D*. The condition is that

*r, s*)th element of

*AX*=0, where

*a*s and

_{ijr}Lemma 1 and Theorem 1 enable us to check whether a given model is identifiable. A toy example is provided below. Suppose there are 3 grid points that produce 2 time intervals. According to the theorem, the rank of *A* must be *A* is *PDP*^{T}+*σ ^{2}I_{N}* is non-identifiable. If we have one additional individual who has one phenotype measured not on any of the grid points, the model becomes identifiable since the rank of

*A*now increases to 7. If we do not have any additional individuals, we can avoid the identifiability issue simply by setting

*σ*

^{2}=0 and modeling

*D*directly.

### Prior specifications

For the random effects of the proposed Bayesian model, we employ the priors presented by Chen and Dunson [26]. Specifically, independent half normal priors are imposed on the diagonal elements of *Δ* and normal priors on the lower triangular elements of *Ψ*. For the fixed effects, we straightforwardly extend the priors presented in Yi et al. [17, 18].

#### Priors on *γ* and *λ*

Let *w _{a}* =

*P*(

*γ*= 1) be the inclusion probability of the

_{a}*a*th genetic effect. We assume that all inclusion probabilities are independent of each other and thus the prior of

*γ*is

*w*is pre-determined and can vary according to whether it corresponds to a main genetic effect, SNP-SNP interaction, or SNP-covariate interaction [17]. To specify a prior on

_{a}*λ*, we assume that the locations are again independent and uniformly distributed over all SNPs. For the number of SNPs (i.e.,

*p*), the prior distribution of genetic variant location

*λ*is therefore given by

#### Priors on b, *Δ*, and *Ψ*

In model (2), we let the distribution of each *b _{ij}* independently follow a standard normal distribution. Thus, the joint prior distribution of

*Δ*and

*Ψ*, we define two vectors

*δ*=(

*δ*∶

_{l}*l*=1, ...,

*k*)

^{T}and

*ψ*=(

*ψ*∶

_{ml}*m*=2, ...,

*k*;

*l*=1, ...,

*m*-1)

^{T}. The prior distribution for

*δ*is

*ψ*is

*ψ*

_{0}and

*R*

_{0}are pre-specified hyperparameters.

#### Priors on β, μ, and *σ*^{2}

The prior for the *a*th genetic effect is a normal distribution, *χ*^{2} distribution, *ν _{β}* controls the skewness of the prior for

*ν*=6) and the scale parameter

_{β}*V*be the total phenotypic variance and

*V*be the sample variance of the column of

_{a}*x*associated with

_{i}*β*. The heritability of the

_{a}*a*th genetic factor,

*h*, is therefore

_{a}*h*)= 0.1. The prior for the overall mean μ is given by

_{a}*σ*

^{2}is chosen as an scaled inverse

*χ*

^{2}distribution,

### Posterior calculation and MCMC algorithm

The joint posterior distribution is proportional to the product of the likelihood and the prior distributions of all unknown parameters, which can be expressed as

where *θ*=(*λ, β, b, δ, ψ, μ, σ*^{2})^{T}. To obtain MCMC samples of all parameters, we utilize the Metropolis-Hastings and Gibbs sampling algorithms, and alternately update each unknown parameter or set of unknown parameters conditional on all the other parameters and the observed data.

For *γ* and *λ*, we use the Metropolis-Hastings algorithm within Gibbs sampler since their conditional distributions have no known distributional forms. To update those parameters, we straightforwardly extend the Metropolis-Hastings algorithm proposed by Yi et al. [18] for our Bayesian model. These algorithms are described in the Supplementary Data 1. For the other parameters, we applied the Gibbs sampling algorithm. Specifically, since *b*, *δ*, and *ψ* have multivariate normal or half normal priors, the full conditional distributions are easy to derive by their conjugacy properties. The full conditional posterior distributions of *b*, *δ* and *ψ* are *θ*_{-f} represents all the elements of *θ* except *f*. The expressions for *b*^{*}, *ψ*^{*}, and *β*, *μ* and *σ*^{2} are *μ*^{*},

### Posterior analysis

The posterior samples can be used to approximate the posterior distribution of the parameters. MCMC samples from the initial iterations are discarded as “burn-in” and the subsequent samples are thinned by keeping every *c*th MCMC sample, where *c* is an integer, and discarding the rest. The posterior inclusion probability of each SNP can be calculated using its inclusion proportion in the MCMC samples as *κ _{l}* is

*l*th SNP position (

*l*=1, ...,

*h*) and

*T*is the total number of MCMC samples. With the prior

*l*th SNP (

*κ*) against exclusion of the

_{l}*l*th SNP as

The Bayes factor *BF*(*κ _{l}*) reflects how our belief in the importance of the

*l*th SNP changes as we move from prior knowledge to posterior information. Jeffreys [28] and Yandell et al. [29] suggest the following criteria for judging the significance of each SNP: weak support if

*BF*(

*κ*) falls between 3 and 10; moderate support if

_{l}*BF*(

*κ*)falls between 10 and 30; and strong support if

_{l}*BF*(

*κ*) is larger than 30.

_{l}### Choice of the number of grid points

A critical issue with the proposed Bayesian model is how to choose an optimal number of grid points, *k*. We achieve this goal by evaluating the goodness of the predictive distributions of our Bayesian models. Spiegelhalter et al. [30] proposed the DIC as *P _{D}*, is the effective number of parameters, which is defined as

*γ*and

*θ*. Since

*P*, and thus the predictive distribution from the DIC tends to overfit the data. To overcome the overfitting problem, Ando [32] developed the BPIC, which is defined as

_{D}### Implementation in gridbayes

The proposed grid-based Bayesian mixed models have been implemented in an R package named gridbayes [33], which is built on top of the R packages, qtl [34] and qtlbim [29]. The MCMC algorithm in C and the data manipulation procedure in R were modified for longitudinal analysis. The gridbayes package employs both DIC and simplified BPIC scores to select the optimal number of grid points. The software package and the source code are available for download at https://github.com/wonilchung/GridBayes.

## Results

### Simulation I

To evaluate the performance of the proposed method, we conducted the following simulations. We first used 400 individuals and 1, 000 SNPs from the 1000 Genome Project data. The number of measurements for each individual ranged from 3 to 7 and the total number of observations was set to 2, 000. Six different setups (Setups 1‒6) were considered. We simulated the datasets containing 10 causal SNPs, which are randomly selected (i.e., the proportion of causal SNPs = 1%) with only main effects (Setup 1). For individual *i*, the phenotype values were generated from the model: *x _{ia}* (a=1, ..., 10) were genotype values of the causal SNPs,

*c*

_{g1}is used to set trait-heritability to 40%,

*t*=(

_{i}*t*

_{i1}, ...,

*t*)

_{ini}^{T}were the time covariates generated from the uniform distribution

*U*[0, 1] and then standardized to have mean 0 and variance 1, and

*e*∼

_{i}*N*(0,

*σ*

^{2}

*I*). We set

_{ni}*σ*

^{2}=1. The true number of grid points was set to 3 (i.e., true

*k*=3), and

*p*was calculated from

_{i}*t*by the linear interpolation as we described in the Methods section. We set

_{i}*δ*=(

*δ*

_{1},

*δ*

_{2},

*δ*

_{3}) = (1, 1.2, 0.8) and

*ψ*=(

*ψ*

_{21},

*ψ*

_{31},

*ψ*

_{32})=(0.6, 0.4, 0.6). That is,

*ν*∼

_{i}*N*(0,

*D*) with

*diag*(

*D*)=(1, 1.96, 0.97) and the lower triangle elements (

*d*

_{21},

*d*

_{31},

*d*

_{32})=(0.72, 0.32, 0.81). The prior distributions for the elements in

*δ*were independent

*N*

^{+}(0, 30) and the prior distributions for the elements in

*ψ*were independent

*N*(0, 0.5). For each simulated dataset, the MCMC algorithm ran for 4 × 10

^{5}iterations after discarding the first 1, 000 burn-in iterations. The remaining samples were further thinned for every 40 iterations, yielding 10

^{4}MCMC samples for the posterior analysis.

To further investigate the Bayesian mixed model, we analyzed additional datasets containing two SNP-SNP interactions (Setup 2), five SNP-SNP interactions (Setup 3), two SNP-time interactions (Setup 4), five SNP-time interactions (Setup 5), or ten SNP-time interactions (Setup 6). Specifically, we simulated data according to the following models: *c _{gj}* (j = 1, ..., 6) were varied to ensure that trait-heritability to 40%. To display time-dependent SNP effects for Setups 4 and 5, we compared the time-dependent curves of averaged phenotype values for three different genotypes (0, 1, 2) at the first causal SNP (with no SNP-time interaction) and 10th one (with SNP-time interaction). Supplementary Fig. 1 clearly showed that the first causal SNP had only a main effect, but the 10th causal SNP interacted with time. We first conducted gridbayes [33] using all the data. For model comparisons, we then conducted qtlbim [29] in two ways: once on a subset of each simulated data, where only one measurement from each subject was randomly selected, and once with all the data by (incorrectly) assuming that all the measurements were independent. We named the two qtlbim analyses “qtlbim-sub” and “qtlbim-all, ” respectively.

The one-dimensional genome-wide profiles of 2*log*(*BF*) for the combined main, epistatic effects, and SNP-time interactions of each SNP under the six setups were presented in Figs. 1 and 2. The dashed vertical lines indicate the locations of the 10 causal SNPs. The gridbayes analysis of all the data and qtlbim-sub detected the causal SNPs reasonably well, but gridbayes clearly outperformed qtlbim-sub in general. The qtlbim-all method occasionally identified the true causal SNPs, but it produced far more false-positive findings than gridbayes and qtlbim-sub.

To evaluate the performance of our Bayesian model, we further calculated the receiver operating characteristic (ROC) curves. For each setup, we conducted 100 simulations. The ROC curves with a false-positive rate less than 0.2 are presented in Fig. 3. The solid lines represent the results of gridbayes, the dot-dashed lines correspond to qtlbim-sub and the results from qtlbim-all are summarized by the long-dashed lines. The ROC curves demonstrated that gridbayes with all measurements appeared to outperform the qtlbim analyses in terms of improved true positive rates.

To diagnose the convergence of the MCMC samples, we conducted 10 parallel chains with different, over-dispersed initial values with respect to the true posterior distribution. Using 10^{4} iterations, Geweke's Z-scores [35] for each chain based on the first 10% and last 50% of the samples indicated good convergence of all parameters. Based on 10 chains, Gelman and Rubin's potential scale reduction factors [36] were calculated, and the upper limits were less than 1.01 for all parameters. Supplementary Fig. 2 presents the trace plots of *σ*^{2}, *δ*_{1}, *δ*_{2}, *δ*_{3}, *ψ*_{21}, *ψ*_{31} and *ψ*_{32} for each setup, showing that all chains moved around the true values for all parameters, indicating good convergence. We plotted the marginal posterior and prior densities of all parameters based on 10, 000 random draws (Supplementary Fig. 3). It appeared that the random draws were approximately normal, with means close to the simulated values. Supplementary Fig. 4 displays the 95% highest posterior density (HPD) intervals for *σ*^{2}, *δ*_{1}, *δ*_{2}, *δ*_{3}, *ψ*_{21}, *ψ*_{31} and *ψ*_{32} for each setup. Most of the 95% HPD intervals contained the corresponding true values. Table 1 summarizes the posterior estimates of all parameters. The posterior means and medians were close to the true values and all the 95% HPD intervals contained the true values, demonstrating the good performance of our algorithm.

### Simulation II

We conducted another simulation to estimate the number of true grid points using the DIC [30] and simplified BPIC [32,37]. The settings were almost the same as those in the previous simulations, except that the true number of grid points now varied from 2 to 4 (i.e., true *k*=2, 3, 4). We simulated 100 datasets with 400 individuals and 1, 000 SNPs containing 10 causal SNPs (i.e., the proportion of causal SNPs = 1%) with only main effects. The causal SNPs were randomly assigned. The trait-heritability was set to 40%. The phenotype values were generated from the model: *x _{ia}* (

*a*=1, ..., 10) are genotypes of the causal SNPs and

*t*=(

_{i}*t*

_{i1}, ...,

*t*)

_{ini}^{T}are the time points of the

*i*th individual. We set (

*δ*

_{1},

*δ*

_{2},

*δ*

_{3},

*δ*

_{4})=(1, 1.2, 0.8, 0.7) and (

*ψ*

_{21},

*ψ*

_{31},

*ψ*

_{32},

*ψ*

_{41},

*ψ*

_{42},

*ψ*

_{43})=(0.6, 0.4, 0.6, 0.2, 0.4, 0.6). Table 2 shows the average DIC, simplified BPIC scores over 100 simulations, and the proportion of times that the number of true grid points was correctly selected. All average DIC and average BPIC scores achieved the minimums at the true grid point number, and the percentages correctly selecting the true number of true grid points were 79%, 91%, and 100% for setups with 2, 3, and 4 true grid points using the DIC, and 94%, 98%, and 93% using the simplified BPIC. This illustrated the usefulness of the DIC and simplified BPIC in selecting the true number of grid points.

### Simulation III

For a more detailed evaluation of our Bayesian method, we conducted the following simulations with 100 replications for each scenario. We first considered 400 individuals with three to seven time points, resulting in 2,000 observations, and decreased the sample size from 400 to 100 to assess the effect of sample size in ROC curves (Fig. 4A). The simulation data contained 1,000 SNPs with 1% causal SNPs (i.e., 10 causal SNPs) with only main effects. The trait values were generated from the model: *x _{ia}* are genotypes of the causal SNPs and

*t*=(

_{i}*t*

_{i1}, ..., t_(i

*n*))

_{i}^{T}are the time points of the

*i*th individual. As in the previous simulations, we set the number of grid points to

*k*= 3 and

*σ*

^{2}= 1,

*δ*=(

*δ*

_{1},

*δ*

_{2},

*δ*

_{3}) = (1, 1.2, 0.8),

*ψ*=(

*ψ*

_{21},

*ψ*

_{31},

*ψ*

_{32})=(0.6, 0.4, 0.6). The trait-heritability was set to 40%. As the sample size decreased from 400 to 100, the true positive rates decreased in ROC curves, indicating that including more samples increased the true positive rates with fixed false positive rates. Next, we evaluated the effect of the number of SNPs (Fig. 4B). The simulation data were generated with 400 individuals, 1% causal SNPs, and 40% trait-heritability. As the number of SNPs increased from 1,000 to 5,000 (i.e., the corresponding number of causal SNPs increased from 10 to 50), the true positive rates decreased, meaning that the inclusion of more SNPs deceased the true positive rates. We then examined the effect of the proportion of causal SNPs (Fig. 4C). The sample size and number of SNPs were fixed to 400 and 1,000, and the trait-heritability was set to 40%. The true positive rates decreased as the proportion of causal SNPs increased from 1% to 5% (i.e., the corresponding number of causal SNPs increased from 10 to 50) because per-SNP heritability—or the average proportion of phenotypic variation explained by a single SNP—decreased as the proportion of causal SNPs increased while keeping trait-heritability constant. Lastly, to demonstrate the effect of trait-heritability, we considered a setting where the sample size, number of SNPs, and proportion of causal SNPs were 400, 1,000, and 1%, respectively. We then changed trait-heritability from 40% to 10% in Fig. 4D. The true-positive rates decreased as trait-heritability decreased, showing that larger heritability increased the true positive rates. Supplementary Tables 1, 2, 3, and 4 summarize the posterior means, medians, standard deviations and 95% HPD intervals of all parameters in the simulations for sample size, number of SNPs, proportion of causal SNPs, and heritability, respectively. The posterior means and medians were close to the true values, and all the 95% HPD intervals contained the true values, indicating that our Bayesian method performed well. Supplementary Table 5 showed the average DIC and simplified BPIC scores over 100 replications for all simulations. Table 3 summarizes the simulation settings for all simulation setups based on genetic effect terms, the number of grid points, sample size, number of observations, number of SNPs, number of causal SNPs, and trait-heritability.

## Discussion

We developed a grid-based Bayesian mixed model for longitudinal genetic data with a built-in variable selection feature. The proposed Bayesian method modeled multiple candidate SNPs simultaneously and allowed SNP-SNP and SNP-time interactions, which enabled us to identify SNPs with time-varying effects. Such SNPs are of great scientific and medical interest. In addition, we proposed a new grid-based method to model the covariance structure nonparametrically. Not only is the proposed method parsimonious in estimating the covariance matrix, but also by employing a reasonable number of grid-points, it can flexibly approximate any type of covariance structure. The number of grid points was pre-set, but DIC and simplified BPIC can be used to select the optimal number.

The simulation studies showed that the proposed Bayesian method using all time points outperformed the ordinary Bayesian method with one or all time points included. As expected, the proposed method that utilized the full data was more powerful than the corresponding univariate analysis method that only used a subset of the data. Furthermore, the proposed Bayesian method performed better than the ordinary Bayesian method because our method modeled the within-subject correlation. Further simulation studies showed that statistical power increased as the data had more samples, a smaller number of SNPs, a lower proportion of causal SNPs, and larger trait-heritability. For our simulation studies, we utilized data from the 1000 Genome Project. With only 400 independent samples of EAS ancestry, we restricted out analysis with up to 5, 000 SNPs. With a sufficient sample size, our method can be applied to all available SNPs. We are currently developing a parallel computing algorithm based on the message passing interface to execute multiple groups of SNPs simultaneously. This will make it feasible to apply our method to large-sample GWAS data.

Another important issue to mention is Bayesian model identifiability. In the Bayesian community, there is a wide diversity of views on the identifiability issue. Lindley [38] remarked that non-identifiability causes no real difficulty in Bayesian approaches. Poirier [39] and Eberly and Carlin [40] argued that a Bayesian analysis of a non-identifiable model is always possible if priors on all of the parameters are proper, since proper priors yield proper posterior distributions, and hence every parameter can be well-estimated. However, if the priors imposed on any non-identifiable model are not proper, or too close to being improper, ill-behaved posterior distributions may be generated such that the trajectory of the parameters can drift to extreme values, as demonstrated by Gelfand and Sahu [41]. In this paper, we investigated the identifiability of our Bayesian model, which motivated us to utilize only proper priors (see the Methods section). Non-identifiability occurred when the number of the grid points equaled the number of observed time points (see Supplementary Data 1), but we found that the posterior distribution behaved well due to the proper priors employed.

## Notes

**Authors’ Contribution**

Conceptualization: WC. Data curation: WC, YC. Formal analysis: WC, YC. Funding acquisition: WC. Methodology: WC. Writing - original draft: WC, YC. Writing - review & editing: WC.

**Conflicts of Interest**

No potential conflict of interest relevant to this article was reported.

## Acknowledgements

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (2020R1C1C1A01012657) and Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (2021R1A6A1A10044154). This work was supported by the Soongsil University Research Fund.

## Supplementary Materials

Supplementary data can be found with this article online at http://www.genominfo.org.