^{1}

^{2}

^{*}

Predicting individual traits and diseases from genetic variants is critical to fulfilling the promise of personalized medicine. The genetic variants from genome-wide association studies (GWAS), including variants well below GWAS significance, can be aggregated into highly significant predictions across a wide range of complex traits and diseases. The recent arrival of large-sample public biobanks enables highly accurate polygenic predictions based on genetic variants across the whole genome. Various statistical methodologies and diverse computational tools have been introduced and developed to compute the polygenic risk score (PRS) more accurately. However, many researchers utilize PRS tools without a thorough understanding of the underlying model and how to specify the parameters for the best performance. It is advantageous to study the statistical models implemented in computational tools for PRS estimation and the formulas of parameters to be specified. Here, we review a variety of recent statistical methodologies and computational tools for PRS computation.

Accurately predicting complex traits and diseases (e.g., type 2 diabetes, cancer, and asthma) based on an individual’s genetic variants is crucial for effective disease prevention and personalized treatment [

For accurate PRS estimation, various statistical methodologies have been proposed and diverse computational tools have been developed, such as PLINK (

Here, we review various statistical methodologies and computational tools for PRS computation. First, we review summary-based PRS methods with a few published SNPs or whole SNPs from large-sample GWAS using PLINK [

A study of schizophrenia showed that the PRS achieved significantly better prediction in validation samples than random scores, and far more accurate than those based on the single GWAS loci discovered in the study [

Let _{ij}_{i}^{2}, which is defined as the square of the correlation between the true and predicted phenotypic values. For a case-control trait (e.g., type 2 diabetes or cancer), the PRS is evaluated by the area under the curve (AUC) [^{2}, and ^{2} on the liability scale [^{2} is bounded by the heritability explained by GWAS-significant SNPs (^{2} with GWAS-significant SNPs [

The prediction ^{2} of published SNPs depends on the genetic architecture of the phenotypes. Under an infinitesimal genetic architecture, all SNPs are causal with relatively small effect size, and thus the associated SNPs identified by GWAS studies explain a small amount of genetic variance, achieving poor prediction ^{2}. For example, the narrow-sense heritability (^{2}) for BMI is ^{2}=0.4-0.6, but the heritability explained by GWAS-significant SNPs with >300K samples yields ^{2} of 0.027 at most [^{2}. For example, the narrow-sense heritability of type 1 diabetes is estimated as ^{2}=0.9, but the heritability explained by GWAS-significant SNPs is ^{2} of 0.6 at most [

Polygenic risk prediction can be performed using all SNPs from GWAS studies, not only GWAS-significant SNPs. The PRS can be estimated as ^{2} is bounded by the heritability explained by the genotyped SNPs (^{2} is

In order to utilize all SNPs to compute the PRS, there are two main considerations: (1) the non-infinitesimal genetic architecture of the phenotype, and (2) the LD structure of the genotype data. That is, _{T}), which only considers SNPs with a p-value (P) less than the threshold (P_{T}). The best P_{T} threshold is selected when the threshold achieves the best prediction accuracy in validation samples. In the absence of an independent validation sample, the data can be divided into training and validation data sets, and threshold selection process is repeated with different partitions of the samples by performing k-fold cross-validation. The standard heuristic approach to account for LD structure is LD pruning and LD clumping. LD pruning randomly removes one of each pair of linked SNPs based on the genotypic correlation (r^{2}), while LD clumping removes SNPs with less significant p-values for the phenotype among pairs of linked SNPs. Both pruning approaches also require optimization of the best r^{2} threshold in validation samples. P-value thresholding and LD-pruning are widely used for PRS computation, but these approaches do not achieve maximum prediction accuracy.

Popular genetic tools, such as PLINK [^{2} threshold (i.e., r^{2} < 0.1). For p-value thresholding, the SNPs are generated with p-values less than a provided threshold (P_{T}) and then candidate PRSs corresponding to the thresholds are created with the PLINK options (e.g., ^{2}. For the automation of the C+T approach in PLINK, we can utilize PRSice and PRSice-2, which are options in R software for computing and evaluating the PRS. PRSice and PRSice-2 are popular PRS tools and constitute efficient and scalable software for automating and simplifying PRS computation on large-scale GWAS data. They handle imputed data as well as genotyped data and simultaneously evaluate a large number of continuous and binary phenotypes. Similar to PLINK, they require summary data as well as phenotype, covariate, and genotype data for the target samples. They automate the procedure of the standard C+T method, which utilizes PLINK options for PRS analysis.

A critical issue in estimating the PRS is the LD structure between SNPs, which has been heuristically addressed by LD pruning and LD clumping. Recently, LDpred was developed as a more sophisticated method that also utilizes summary statistics [

LDpred is an LD matrix and summary statistics–based Bayesian method for polygenic prediction, which is a popular tool for deriving the PRS [_{j}_{ij}

In the special case of no LD between SNPs, the posterior mean can be computed analytically. Under a Gaussian infinitesimal prior, _{j}

In the case of LD between SNPs, the posterior means can be computed analytically only with an infinitesimal prior. Under a Gaussian infinitesimal prior, the posterior mean effect size is derived as _{j}_{j}_{j}

The procedure for computing the PRS using LDpred consists of three steps: (1) synchronizing the genotype and summary data, (2) generating LDpred SNP weights, and (3) generating the individual PRS. The first step synchronizes genotype and summary statistics and then generates the coordinated genotype data with the ‘

The construction of a genome-wide PRS using LDpred requires summary statistics from existing large-scale GWAS studies (e.g., the UK Biobank [

Recently, LDpred-2, a new version of LDpred, was developed to improve predictive performance compared to LDpred [

An alternative to summary-based approaches is to fit the effect sizes of all SNPs simultaneously using BLUP models, which is a more traditional approach for computing the PRS. Fitting all SNPs simultaneously is more appropriate than summary-based approaches, producing more accurate predictors.

GBLUP methods utilize individual-level GWAS data, not summary statistics, to estimate SNP effects using LMMs. The GBLUP model is _{new}

The GBLUP models require individual-level genotype and phenotype data for training, but this is not always possible. Instead, summary SBLUP models can be utilized by approximating individual-level genotype and phenotype data using summary statistics and a reference panel [_{t}_{r}

GCTA software was initially designed to estimate SNP-based heritability and has been extended for many other genetic analyses including GBLUP and SBLUP. For GBLUP analysis, the GRM (

BMR methods extend the standard LMM by including an alternative prior for SNP effects, further improving prediction accuracy [

The BMR model, BayesR [_{1},…,_{C}

The BayesR model with individual-level data was extended to utilize summary statistics from GWAS studies in SBayesR [^{-1}^{T}^{-1}^{T}^{-1}^{T}^{-1}^{T}^{-1}^{T}_{1},…,_{C}_{1},…,_{C}

GCTB (Genome-wide Complex Trait Bayesian Analysis,

GBLUP-based methods implicitly assume an infinitesimal genetic architecture, whereas in reality complex traits or diseases are estimated to have roughly only a few thousand causal SNPs in the genome [

Penalized regression methods such as the lasso [_{1} norm of _{2} norm of

The lassosum is a method for computing lasso or elastic net estimates using GWAS summary statistics and an LD reference panel [^{T}y^{T}X_{r}

The most popular tool for lasso, ridge, and elastic net regression is ‘glmnet’ in R (

Recent studies have shown that GWAS of related phenotypes further improve the accuracy of polygenic predictions [

In order to utilize multiple traits to improve prediction accuracy, the RRBLUP and GBLUP methods are extended to the bivariate ridge regression method [_{i}_{i}_{i}_{i}_{i}_{i}_{i}_{i}_{i}_{i}_{i}_{i}_{i}

The individual BLUP in a validation sample _{new}_{new}

The wMT-SBLUP [

To utilize multiple traits for PRS computation, the CTPR method was developed [_{1} using lasso or the minimax concave penalty to induce a sparse solution, and _{2} to incorporate shared genetic effects across multiple traits for large-sample GWAS data. It induces smoothness of the coefficients and can incorporate prior knowledge on the similarity of a pair of traits at a given SNP via adjacency coefficients. It also incorporates multiple secondary traits based on individual-level genotypes and/or summary statistics. The PRS in target samples is computed as _{t}

Genetic risk prediction in diverse populations currently lags far behind risk prediction in European samples [

We have reviewed statistical models and computational tools for PRS computation. We have demonstrated a variety of statistical models for genomic risk prediction using individual-level data and/or summary statistics and showed how to improve prediction accuracy with multiple traits and multiple populations. Furthermore, we have introduced recent computational tools to conduct PRS analyses based on the statistical models, and explained how to specify the parameters and how to execute the software in detail. We also summarized which statistical models and software are best for specific situations based on data type (GWAS summary statistics or individual-level GWAS data), sample size, the LD reference panel, the number of traits, and the number of ethnicities, as shown in

Despite the existence of various PRS methods, there are some areas in which further research on PRS is required. To improve prediction accuracy, we need novel statistical models and software that leverage information from multiple disease outcomes and multiple ethnicities based on individual-level genotype data and/or summary statistics from large-scale biobanks. It is also necessary to develop methods with the ability to predict diverse disease traits, such as cardiovascular disease and type 2 diabetes, with sufficient accuracy (to the extent allowable by disease heritability), and then these models need to be extended to utilize multiple ethnicities by incorporating information on LD to further improve prediction accuracy.

Moreover, with advances in high-throughput molecular assays (e.g., RNA-seq and ChIP-seq), it has been shown that disease risk SNPs are enriched in a broad array of functional regions, including regulatory features that are often tissue-specific, providing a novel source of information for improved prediction accuracy. It has been further shown that these molecular features can be predicted from genetic variants, enabling the prediction of gene expression in GWAS cohorts to perform transcriptome-wide association studies and to identify putative susceptibility genes. The accurate prediction of individual molecular features is now an emerging tool for discovering novel disease loci and characterizing biological mechanisms at the thousands of GWAS loci that have already been published. Data collection efforts of an unprecedented scale are now being seen in the areas of functional genomics and disease genetics. Such datasets can help to prioritize causal features and further improve prediction accuracy.

We conclude by emphasizing the importance of creating accurate PRS for a wide range of complex traits and diseases. The PRS provides an estimate of genetic predisposition (also called genetic susceptibility) for a complex trait or disease at the individual level, which refers to the likelihood of developing a particular trait or disease based on a genotype profile. The goal of PRS analysis is to identify individuals at an elevated risk of diseases on the basis of genetic variants in combination with clinical covariates. Therefore, the more accurate PRS we obtain, the better we can identify disease risk and the better we can provide treatment and prevention strategies. Personalized medicine based on accurate PRS will have a considerable impact on the treatment process and quality of life in the near future.

Conflicts of Interest

No potential conflict of interest relevant to this article was reported.

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (NRF-2020R1C1C1A01012657) and Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (2021R1A6A1A10044154). This work was supported by Soongsil University Research Fund.

Best statistical models and software based on data type, sample size, LD reference panel, and the number of traits and ethnicities. CTPR, cross-trait penalized regression; GBLUP, genomic BLUP; GWAS, genome-wide association studies; LD, linkage disequilibrium; MTGBLUP, multi-trait GBLUP; SBLUP, statistics BLUP; wMT-SBLUP, weighted multi-trait SBLUP.

List of PRS methods, underlying statistical models, computational tools, and required data

Trait/Ethnicity | Method | Statistical model | Computational tool | Required data |
---|---|---|---|---|

Single trait, single ethnicity | PRS | Linear model | PLINK, PRSice, PRSice-2 | Summary data |

LDpred | Bayesian model | LDpred, LDpred-2 | Summary data | |

GBLUP | LMM | GCTA | Individual data | |

SBLUP | LMM | GCTA | Summary data | |

BayesR | Bayesian model | GCTB | Individual data | |

SBayesR | Bayesian model | GCTB | Summary data | |

Penalized Regression | Penalized regression | glmnet | Individual data | |

Lassosum | Penalized regression | lassosum | Summary data | |

Multiple traits, single ethnicity | MTGBLUP | Multivariate LMM | MTG | Individual data |

wMT-SBLUP | Multivariate LMM | wMT-SBLUP | Summary data | |

CTPR | Multivariate penalized regression | CTPR | Individual data | |

Single trait, multiple ethnicities | XP-BLUP | Two-component LMM | XP-BLUP | Individual data |

Multi-ethnic PRS | Linear mixture approaches | multi-ethnic PRS | Summary data | |

Multi-ancestry PRS | Linear mixture approaches | multi-ancestry PRS | Summary data |

PRS, polygenic risk score; GBLUP, genomic BLUP; LMM, linear mixed model; GCTA, Genome-wide Complex Trait Analysis; SBLUP, statistics BLUP; GCTB, Genome-wide Complex Trait Bayesian Analysis; MTGBLUP, multi-trait GBLUP; wMT-SBLUP, weighted multi-trait SBLUP; CTPR, cross-trait penalized regression.