### Introduction

### Polygenic Risk Prediction

### Use of a few published SNPs

*j*(i.e., a GWAS-significant SNP from previous GWAS studies),

*x*denote the genotype for SNP

_{ij}*j*of individual

*i*, and

*y*denote the phenotype of individual

_{i}*i*. The predicted phenotype for individual

*i*can be simply computed as

*R*

^{2}, which is defined as the square of the correlation between the true and predicted phenotypic values. For a case-control trait (e.g., type 2 diabetes or cancer), the PRS is evaluated by the area under the curve (AUC) [29], pseudo-

*R*

^{2}, and

*R*

^{2}on the liability scale [30]. The prediction

*R*

^{2}is bounded by the heritability explained by GWAS-significant SNPs (

*R*

^{2}with GWAS-significant SNPs [31,32].

*R*

^{2}of published SNPs depends on the genetic architecture of the phenotypes. Under an infinitesimal genetic architecture, all SNPs are causal with relatively small effect size, and thus the associated SNPs identified by GWAS studies explain a small amount of genetic variance, achieving poor prediction

*R*

^{2}. For example, the narrow-sense heritability (

*h*

^{2}) for BMI is

*h*

^{2}=0.4-0.6, but the heritability explained by GWAS-significant SNPs with >300K samples yields

*R*

^{2}of 0.027 at most [2,33]. Instead, under a non-infinitesimal genetic architecture, only a subset of SNPs has moderate to large effects whereas most SNPs have zero effects; thus, the associated SNPs identified by GWAS studies explain more genetic variance, yielding higher prediction

*R*

^{2}. For example, the narrow-sense heritability of type 1 diabetes is estimated as

*h*

^{2}=0.9, but the heritability explained by GWAS-significant SNPs is

*R*

^{2}of 0.6 at most [2,34].

### Use of all SNPs from GWAS studies

*R*

^{2}is bounded by the heritability explained by the genotyped SNPs (

*R*

^{2}is

*M*is the total number of SNPs and

*N*is the number of individuals [31,32]. Thus,

_{T}), which only considers SNPs with a p-value (P) less than the threshold (P

_{T}). The best P

_{T}threshold is selected when the threshold achieves the best prediction accuracy in validation samples. In the absence of an independent validation sample, the data can be divided into training and validation data sets, and threshold selection process is repeated with different partitions of the samples by performing k-fold cross-validation. The standard heuristic approach to account for LD structure is LD pruning and LD clumping. LD pruning randomly removes one of each pair of linked SNPs based on the genotypic correlation (r

^{2}), while LD clumping removes SNPs with less significant p-values for the phenotype among pairs of linked SNPs. Both pruning approaches also require optimization of the best r

^{2}threshold in validation samples. P-value thresholding and LD-pruning are widely used for PRS computation, but these approaches do not achieve maximum prediction accuracy.

### PRS tools

*--clump-r2 0.1 --clump-kb 250*) to form clumps of all SNPs that are within a certain distance (in kilobases [kb]) from the index SNPs (e.g., 250 kb) and that are in LD with the index SNP based on the r

^{2}threshold (i.e., r

^{2}< 0.1). For p-value thresholding, the SNPs are generated with p-values less than a provided threshold (P

_{T}) and then candidate PRSs corresponding to the thresholds are created with the PLINK options (e.g.,

*--score, --q-score*). The best PRS is selected among candidate PRSs computed at a range of p-value thresholds based on the prediction

*R*

^{2}. For the automation of the C+T approach in PLINK, we can utilize PRSice and PRSice-2, which are options in R software for computing and evaluating the PRS. PRSice and PRSice-2 are popular PRS tools and constitute efficient and scalable software for automating and simplifying PRS computation on large-scale GWAS data. They handle imputed data as well as genotyped data and simultaneously evaluate a large number of continuous and binary phenotypes. Similar to PLINK, they require summary data as well as phenotype, covariate, and genotype data for the target samples. They automate the procedure of the standard C+T method, which utilizes PLINK options for PRS analysis.

### LD-Based Prediction

### LDpred

*i*,

*β*is the effect size for SNP

_{j}*j*,

*x*is the genotype for sample

_{ij}*i*and SNP

*j*, and

*j*.

*j*,

*β*~0 with probability 1-p, where p is the proportion of causal SNPs. The posterior mean effect size is estimated as

_{j}*j*th SNP is causal, which can be interpreted as non-uniform shrinking of the estimated effect size

*D*is an LD matrix (

*M*×

*M*) that needs to be estimated by the LD in a reference panel (LDpred-inf). LDpred-inf is a natural extension of the GBLUP to summary statistics. Under a Gaussian non-infinitesimal prior, posterior means cannot be computed analytically but they can be computed with Markov-chain Monte Carlo Gibbs samplers. First,

*β*values are initialized based on an infinitesimal prior with LD (LDpred-inf). At each iteration,

_{j}*β*is resampled from

_{j}*f*(

*β*) reflects the point-normal prior (based on

_{j}### LDpred software

*ldpred coord*’ command. It requires one genotype file (LD reference) with at least 1,000 individuals of the same ancestry as the individuals for summary statistics. The second step generates an LD information file with a pre-specified LD radius and re-weights the SNP effects with the ‘

*ldpred gibbs*’ command. One LD information file is created with a pre-specified LD radius, but several SNP weight files are generated corresponding to the different values of p (the proportion of causal SNPs). The third step computes the PRS for individuals in the target dataset with the ‘

*ldpred score*’ command. Separate PRS files are generated corresponding to the different values of p. Additionally, LDpred provides a pruning and thresholding option as an alternative method with the ‘

*ldpred p*+

*t*’ command. This option often yields better prediction results than the original LDpred when the sample size of LD reference panel is not large enough.

*sparse*’ option, which can make SNP effects exactly 0; and (2) the ‘

*auto*’ option, which learns the tuning parameter

*p*, which is the proportion of causal SNPs, directly from the dataset. LDpred-2 was implemented in the R package ‘

*bigsnp*’.

### BLUP-Based Prediction

### GBLUP

*y*=

*Xβ*+

*g*+

*e*, where

*y*is a vector of phenotypes (

*N*×1),

*X*is a matrix of covariates excluding the SNPs (

*N*×

*C*),

*β*is a vector of covariate effects (

*C*×1) and

*g*is a vector of random genetic effects for all individuals with

*N*×1) (

*A*is a

*N*×

*N*genetic related matrix [GRM]) and

*e*is a vector of random errors with

*N*×1). The genetic values (i.e., individual BLUP) are estimated as

*N*×

*N*matrix. A GBLUP model can be transformed to a ridge regression BLUP model (RRBLUP) [43,44], which is

*y*=

*Xβ*+

*Wu*+

*e*, where

*W*is a matrix of standardized genotypes (

*N*×

*M*) and

*u*is a vector of random SNP effects with

*M*×1). The SNP effects (i.e., SNP BLUP) are estimated as

*A*and individual BLUP

*W*is a matrix of standardized genotypes in the target dataset,

_{new}### SBLUP

*V*is a matrix of standardized genotypes from the reference panel, and

*n*and

_{t}*n*are the sample sizes for the training and reference samples, respectively. This assumes the similarity of allele frequencies and LD structure between training and reference samples. It also approximates

_{r}### GCTA software

*A*) is first estimated from the training genotype data with the ‘

*--make-grm*’ option, and then the individual BLUP (

*--reml-pred-rand*’ option. The SNP BLUP (

*--blup-snp*’ option and used to predict the PRS of individuals in independent validation data with the PLINK option ‘

*--score*’. For SBLUP analysis, the SNP BLUP (

*--bfile*’, ‘

*--cojo-file*’ and ‘

*--cojo-sblup*’ options. The PRS of individuals in validation data is computed using PLINK, which is the same as in GBLUP models.

### BMR-Based Prediction

### BayesR

*y*=

*Xβ*+

*e*where

*y*is a vector of centered phenotypes (

*N*×1),

*X*is a matrix of standardized genotypes (

*N*×

*M*),

*β*is a vector of SNP effects (

*M*×1) and

*e*is a vector of random errors with

*N*×1). It also assumes the SNP effects result from a finite normal mixture of

*C*components, so that the prior for

*β*becomes

*π*=(

*π*

_{1},…,

*π*) and

_{C}*β*is

*β*is sampled using the Gibbs sampling scheme. The posterior mean for SNP effects (

### SBayesR

*β*) to estimates of regression coefficients from

*M*simple linear regression (

*b*) by multiplying

*y*=

*Xβ*+

*e*by

*D*

^{-1}

*X*, where

^{T}*D*

^{-1}

*X*)

^{T}*y*=(

*D*

^{-1}

*X*)

^{T}*Xβ*+(

*D*

^{-1}

*X*)

^{T}*e*. Noting that

*b*=

*D*

^{-1}

*X*

^{T}*y*is the vector of the least squares marginal regression effects estimates and

*D*is replaced by the estimates

*B*is replaced by

*β*is

*C*denotes the pre-specified maximum number of components in the finite mixture model,

*π*=(

*π*

_{1},…,

*π*) and

_{C}*γ*=(

*γ*

_{1},…,

*γ*). The default values are

_{C}*C*=4,

*γ*=(0,0.01,0.1,1). The posterior for

*β*is

*β*, are sampled using the Gibbs sampling approach and the posterior mean for the SNP effects (

### GCTB software

*--bayes*': R for BayesR. The options ‘

*--pi 0.05*’ (a starting value for sampling

*π*) and ‘

*--hsq 0.5*’ (a starting value for sampling

*--sbayes*’: R for SbayesR. The full chromosome-wide LD matrices are estimated using multiple CPUs with the ‘

*--make-full-ldm*’ option, and shrunk LD matrices are built with the ‘

*--make-shrunk-ldm*’ option. SBayesR models are conducted with the options ‘

*--pi 0.95, 0.02, 0.02, 0.01*', ‘

*--ldm*’ (an LD matrix), ‘

*--gamma 0,0.01,0.1,1*’ (a prespecified hyperparameter

*γ*), and ‘

*--gwas-summary*’ (an input file for GWAS summary statistics).

### Penalized Regression-Based Prediction

### Lasso and elastic net

*y*=

*Xβ*+

*e*, where

*y*is a vector of phenotypic values (

*N*×1),

*X*is a matrix of genotypes (

*N*×

*M*),

*β*is a vector of SNP effects (

*M*×1) and

*e*is random error with

*N*×1). The elastic net regression obtains the estimates of

*β*by minimizing the following object function:

*L*

_{1}norm of

*β*,

*L*

_{2}norm of

*β*, and

*λ*and

*α*are tuning parameters to be estimated. When

*α*=1,

*f*(

*β*) becomes the object function for lasso regression, and when

*α*=0, it becomes the object function for ridge regression. The PRSs in target samples are constructed with the estimated SNP effects from the lasso or elastic net and genotype data from the target dataset.

### Lassosum

*r*=

*X*is the SNP-wise correlation between the SNPs and the phenotype and

^{T}y*R*=

*X*is the LD matrix, a matrix of correlations between SNPs. The lassosum approximates

^{T}X*R*by

*s*<1 where

*X*is matrix of genotypes from a reference panel and also approximates

_{r}*r*by obtaining publicly available summary statistics. The lassosum constructs PRSs using summary statistics and a reference panel in a penalized regression setting.

### R packages

### Multi-Trait Approaches

### MTGBLUP

*T*traits):

*y*=

_{i}*X*

_{i}*β*+

_{i}*g*+

_{i}*e*=

_{i}*X*

_{i}*β*+

_{i}*W*

_{i}*u*+

_{i}*e*(i.e.

_{i}*g*=

_{i}*W*

_{i}*u*) where

_{i}*i*=1,…,

*T*. The individual BLUP model

*T*traits are given as:

*W*

_{new}*W*is a matrix of standardized genotypes in the target dataset and

_{new}### wMT-SBLUP

*T*traits can be re-written as

*L*is an

*M*×

*M*scaled LD correlation matrix estimated from a reference panel and

*T*traits can be approximately computed as

### CTPR

*RSS*(

*β*) is the residual sum of squares,

*λ*

_{1}using lasso or the minimax concave penalty to induce a sparse solution, and

*λ*

_{2}to incorporate shared genetic effects across multiple traits for large-sample GWAS data. It induces smoothness of the coefficients and can incorporate prior knowledge on the similarity of a pair of traits at a given SNP via adjacency coefficients. It also incorporates multiple secondary traits based on individual-level genotypes and/or summary statistics. The PRS in target samples is computed as

*X*is a matrix of standardized genotypes in the target dataset,

_{t}