### Introduction

### Traditional Survival Methods

### Basic functions

*T*be a non-negative random variable representing the survival time, and let

*f(t)*and

*F(t)*be the probabilities of density function and cumulative distribution function of

*T*, respectively. Then, the survival function,

*S(t)*, the hazard function,

*h(t)*, and the cumulative hazard function,

*H(t)*are specified as:

*h(t)*is more useful than

*f(t)*in estimating

*S(t)*, because it is the conditional probability of experiencing a specific event instantaneously, given that the survival remains up to that time. It also uses information from the censored observations by considering the conditional probability that they are survived up to that time, through censoring.

### KM estimator

*T*and

*C*be the survival and censoring times, respectively. Then, the observed time is defined as

*δ=I(T ≤ C)*. The survival data are represented as

*d*be the number of deaths at time

_{i}*t*and let

_{i}*Y*be the number of subjects at risk at time

_{i}*t*-, which only counts the subjects still surviving to just prior to time

_{i}*t*. Suppose that

_{i}*p*denotes the conditional probability that a death occurs at time

_{i}*t*, given those still alive just prior to

_{i}*t*. Then the estimate of

_{i}*p*is given as

_{i}*t*can be represented as follows:

_{i}*Y*. Many results about the properties of the KM estimator have been studied, relating to its asymptotic distribution, self-consistency, and efficiency [14,15].

_{i}### Log-rank test

*t*(Table 1 ).

_{i}*d*and

_{i}*Y*denote the number of deaths and individuals at risk at time

_{i}*t*, respectively, and

_{i}*d*and

_{ji}*Y*=1,2) denote the number of deaths and individuals at risk for the corresponding group, respectively. Then, the conditional distribution of

_{ji}, (j*d*

_{1i}, given

*Y*, is hypergeometric, under the null hypothesis of equal survival functions. The log-rank test is then given as follows:

_{i}*d*is the total number of distinct deaths from the two groups. Assuming the independence of

*d*

_{1i}across times, it is known that the log-rank test has an asymptotic chi-square distribution, with one degree of freedom, under the null hypothesis. As shown in the equation above, the log-rank test is powerful when the two hazard rates are proportional across times, since it takes the sum of the differences between the observed and expected number of events. However, if the two hazard rates cross or are not proportional, the log-rank test yields lower power and other tests, such as Kolmogorov-Smirnov, Cramer-von Mises type tests, or median tests, are preferred [18-20].

### Cox regression model

*X*represents the vector of risk predictors, and

*X*, while

*h*is a baseline hazard function with

_{0}(t)*X*=0. By dividing both sides by the baseline hazard function and taking the logarithm, the Cox model can be rewritten as the following linear regression model:

*β*, can be interpreted as the relative hazard rate between two individuals as one unit of

*X*changes. For example, if

*X*=1, for the treatment group, and

*X*=0, for the placebo group, the hazard rate of those who take treatment is

*β*times those in the placebo group. For

*β*<0, the treatment is considered beneficial, whereas for

*β*>0, it is considered deleterious.

*β*, the partial likelihood function of a Cox model was proposed [21], in which only

*β*is involved in both score function and Fisher information, while the unspecified baseline hazard function is not considered. In other words, the statistical inference for

*β*is made on the basis of the partial likelihood, regardless of the baseline hazard function. Only when one is interested in estimating the survival function from a Cox model, should we consider the estimation of the baseline hazard function, which is described by a Breslow’s estimator [16].

*e*, in which no specific distribution is assumed for

^{Xβ}*h*

_{0}

*(t)*. The effect of the regression coefficient of a Cox model is interpreted as the relative hazard rate of the corresponding risk factors, whereas the effect of the regression coefficient of an AFT model is interpreted as an accelerated factor of the survival time. Other types of regression models include Aalen’s additive model [22], and partly parametric, additive risk models [23].

### Regularization Methods for Analyzing Genomic Data

### Penalized Cox models

*δ*is an indicator for the uncensored observation, and

_{i}*λ*is called a “tuning parameter” that controls the degree of regularization. When

*λ*=0, there is no regularization, whereas when

*λ*→∞, the coefficients tend to be more regularized. As shown above, the lasso imposes a

*L*

_{1}- penalty on the regression coefficients, the ridge imposes a

*L*

_{2}- penalty, and the elastic-net model combines the two penalties. In general, the lasso performs well in selecting significant genes, among many thousands, but tends to select only one gene from any specific group of genes, and does not care which one is selected when pairwise correlations, between genes, are very high. Furthermore, for the case of

*p≫n*, the lasso selects at most

*n*variables, due to the nature of the convex optimization problem. On the other hand, the ridge method, originally proposed to solve multicollinearity between predictors, is not appropriate for the variable selection problem. Thus, when the correlation between genes is more of interest rather than variable selection, the ridge penalty is more appropriate. The elastic-net method takes the weighted penalties of both lasso and ridge and performs better than the other two methods, in the sense that it selects more variables than

*n*, even in the instance of

*p≫n*cases, and considers correlations between genes. For example, it was shown in analysis of prostate cancer patient data that the elastic-net model had the smaller test error, with the same number of variables as the lasso [12]. Subsequently, other various modifications relating to regularization have been proposed, such as adaptive lasso-Cox [25], fused lasso [26]. and least angle regression elastic net [27].

### ML Methods for Analyzing Censored Survival Data

### Survival trees

### Support vector machines

### Ensemble methods

### Bagging survival trees

*B*bootstrap survival trees [40]. Although the predicted survival probabilities aggregated from multiple survival trees are not easily interpreted, they are based on similar observations, classified by repetition of learning samples, in the aggregated set. We also note that the bagging survival trees depend on both the number of bootstrap samples and the size of multiple trees. As usually shown in ensemble methods, bagging survival trees results in a conditional survival probability prediction that is better than a single survival tree, in terms of the mean integrated squared errors, even when the censoring proportion is 50%. The software for using bagging survival trees is available at (https://CRAN.R-project.org/web/packages/ipred/).

### Random survival forests

### Cox boosting

*mboost*and

*CoxBoost*[44]. Both

*mboost*and

*CoxBoost*are based on gradient boosting, but differ in the sense that

*mboost*is an adaptation of model-based boosting, whereas

*CoxBoost*adapts likelihood-based boosting. The

*mboost*algorithm computes the direction in which the slope of the partial log-likelihood is steeper, and then estimates an updated parameter, by minimizing the residual sum of squares of the multivariate regression model, with shrinking of the penalized parameter. This procedure is iteratively performed until the stopping criterion is met. On the other hand, the

*CoxBoost*algorithm uses a negative

*L*

_{2}-norm penalized partial log-likelihood, and updates the estimates of the parameter by maximizing this penalized partial log-likelihood, with a tuning penalty.

*L*

_{2}boosting [46], using inverse probability of censoring weighting, and the boosted AFT model [47]. Recently, a new boosting method for nonparametric hazard estimation was proposed, when time-dependent covariates are present [48].