^{1}

^{*}

^{2}

Survival analysis mainly deals with the time to event, including death, onset of disease, and bankruptcy. The common characteristic of survival analysis is that it contains “censored” data, in which the time to event cannot be completely observed, but instead represents the lower bound of the time to event. Only the occurrence of either time to event or censoring time is observed. Many traditional statistical methods have been effectively used for analyzing survival data with censored observations. However, with the development of high-throughput technologies for producing “omics” data, more advanced statistical methods, such as regularization, should be required to construct the predictive survival model with high-dimensional genomic data. Furthermore, machine learning approaches have been adapted for survival analysis, to fit nonlinear and complex interaction effects between predictors, and achieve more accurate prediction of individual survival probability. Presently, since most clinicians and medical researchers can easily assess statistical programs for analyzing survival data, a review article is helpful for understanding statistical methods used in survival analysis. We review traditional survival methods and regularization methods, with various penalty functions, for the analysis of high-dimensional genomics, and describe machine learning techniques that have been adapted to survival analysis.

Survival analysis arises in many applied fields such as medicine, biology, engineering, epidemiology and economics. Survival time is defined as a time to event of interest such as death, relapse of disease, unemployment and completion of a task. The main characteristic of survival time is that it is censored due to the end of study or withdrawal during the period of study because we cannot follow-up the exact survival time for those who are still alive at the end of the study, or who are lost to follow-up during the period of study. However, it is known that the survival time of censored individuals is at least longer than the censoring time. There are many different types of censoring such as right-, left-, and interval-censoring. The most popular censoring mechanisms are right censoring, in which the lower limit of the exact survival time is observed, while the upper limit is observed for left censoring, and both lower and upper limits are used for interval-censoring. More details about censoring mechanism are described [

Many statistical methods have been developed for estimating survival functions, comparing survival curves between two groups, and modeling the survival data by regression, for association with risk factors, such as demographic and clinical predictors. In survival analysis, nonparametric statistical inference is more extensively used to estimate the survival function, and compare survival curves between two or more groups. For example, both the Kaplan-Meier (KM) estimator [

In the early 21st century, DNA microarrays for characterizing gene expression patterns have been used to more distinctly classify diseases, including cancer subtypes [

The human genome was determined to possess a sequence of 3 billion nucleotides, as determined by Human Genome Project in 2003 [

Since many clinicians can easily assess statistical programs such as SAS, SPSS, and R for analyzing survival data, a review article is helpful for understanding statistical methods used in survival analysis. We will briefly review the basic concepts and theories in survival analysis by explaining a KM estimator for the survival function, a log-rank test for comparing two-sample survival curves, and a Cox model for studying association with risk factors. Then the most widely used regularization methods will be described for high-dimensional genomic data analysis. Finally, a comprehensive review of omics ML methods will be given, with a short conclusion.

Three functions are used to characterize the distribution of survival time, namely, the survival function, hazard function, and probability density function [

In addition, the following relationship is easily derived, and if any one of these functions is known, then the other functions can be uniquely determined.

In survival analysis,

For estimating the survival function, the KM [

Let _{i}_{i}_{i}_{i}_{i}_{i}_{i}_{i}_{i}_{i}

From this, the KM estimator is given as _{i}

For clinical trials, it is common to assess the efficacy of a new drug or treatment compared to a placebo group. If the response variable is completely observed, either a t-test or Wilcoxon test is most suitable to solve this two-sample testing problem. However, neither of these is suitable for censored survival data. A log-rank test was originally proposed for one-sample problems [_{i}

Here, _{i}_{i}_{i}_{ji}_{ji}, (j_{1i}, given _{i}

where _{1i} across times, it is known that the log-rank test has an asymptotic chi-square distribution, with one degree of freedom, under the null hypothesis. As shown in the equation above, the log-rank test is powerful when the two hazard rates are proportional across times, since it takes the sum of the differences between the observed and expected number of events. However, if the two hazard rates cross or are not proportional, the log-rank test yields lower power and other tests, such as Kolmogorov-Smirnov, Cramer-von Mises type tests, or median tests, are preferred [

Most statistical methods of survival analysis have focused on finding risk factors among many possible demographic, environmental, and clinical variables, and predicting the survival probability of the patient with a certain disease. For censored survival data, a Cox regression model [

Here, _{0}(t)

Then, the regression coefficient,

For the estimation of

Since the log ratio of two hazard rates does not depend on time, as shown by the equation above, this derivation is known as a proportional hazards (PH) model. This proportionality between hazard functions is a strong assumption in real-life situations and requires evaluation by a goodness-of-fit test. However, since the Cox model is a fundamental basis for association studies in survival analysis, it has been further generalized to the stratified Cox model and the time-dependent Cox model, in which the proportionality assumption is not valid.

On the other hand, an AFT regression model is also widely used to fit the relationship between survival time and risk factors, in which the log survival time is specified as linear combinations of risk factors, with error random variable. According to the distribution of the error random variable, the parameters of the AFT model are estimated by maximum likelihood methods [^{Xβ}_{0}

Since microarray data is used in association studies with survival time, a number of studies have been published regarding solutions to high-dimensional problems, in which there are too few observations for too many variables. To identify significant disease-associated genes, single-gene approaches were applied to circumvent the high-dimensional problem with adjustment of multiple testing problem. However, there remain limitations to the single-gene approach, because it is too simple to explain complex associations between genes, environments, and diseases.

For processing high-dimensional genomic data, one solution is to regularize selection of significant variables, via penalized models. Such regularization may rely on an assumption of sparsity, i.e., that only a few genes have significant effects on diseases, among thousands of genes [

Here _{i}_{1}- penalty on the regression coefficients, the ridge imposes a _{2}- penalty, and the elastic-net model combines the two penalties. In general, the lasso performs well in selecting significant genes, among many thousands, but tends to select only one gene from any specific group of genes, and does not care which one is selected when pairwise correlations, between genes, are very high. Furthermore, for the case of

Recently, ML techniques have been rapidly adapted to a variety of fields, for automatically analyzing huge amounts of data. The basic concept of ML is to make the computer “learn” from repeated input data, and recognize hard-to-discern patterns from large, noisy, or complex data. This ML approach is well-suited to construct a predicted model when there are both nonlinear and complex interactions, among several features. Thus, ML has been widely applied to cancer prognosis and prediction, for medical applications. Predicted survival rates are particularly interesting, as they are part of a growing trend toward personalized medicine.

Although ML techniques have been developed and applied to artificial intelligence and data mining [

Many types of ML systems exist, depending on whether they are trained with human supervision, such as supervised, unsupervised, semi-supervised, and reinforcement learning. Among those, the most interesting one is supervised learning, whose main task is classification, and predicting a target variable such as survival time. In this review, we will focus on the following ML techniques that are adapted to survival data: survival trees, support vector machines (SVMs), and ensemble methods such as bagging survival trees, random survival forests, Cox boosting, and artificial neural networks (ANNs).

Decision trees have been useful for the classification and prediction of a wide range of applications, because it requires few statistical assumptions, readily handles various data structures, and provides easy and meaningful interpretation. Several studies on the practical and theoretical aspects of tree-based methods were developed, and the classification and regression tree (CART) software program has made tree-based methods popular, applied statistical tools [

Survival trees were first proposed by adapting most of the CART paradigm for analyzing censored survival data by minimizing the within-node variabilities in survival time. Alternatively, the other approach for survival tree construction has been developed by maximizing the difference in survival between “child nodes,” as measured by two-sample test statistics, such as a log-rank [

The components of the survival tree algorithm consist of rules for growing the tree, pruning the tree, and choosing a tree of the appropriate size. The most common rule for growing and pruning a tree is a log-rank test, which tests for dissimilarity in survival between two groups. More properties for measures of splitting have been studied in detail [

Recently, survival trees have been constructed for analyzing multivariate survival time data, when the subjects under study are either naturally clustered or experience multiple events (namely, recurrent times) [

SVMs are powerful ML methods, capable of performing linear or nonlinear classification, regression, and outlier detection. SVMs were first proposed for binary classification problems, and then subsequently extended to regression, clustering, and survival analysis. The main idea of SVMs is to maximize the margin between two classes and find a separating hyperplane that minimizes misclassification. The separating SVM hyperplane not only separates the two group classes, but also stays as far away from the close observations possible. The observations located on the edge of the separating hyperplane are known as the support vectors that fully determine the classification.

Although linear SVM classifiers are efficient and perform well in many cases, high-dimensional datasets are often not separated by a linear SVM classifier. To handle both nonlinear and high-dimensional datasets, the SVM classifier uses a high-dimensional kernel function to make the original dataset linearly separable. SVMs were subsequently extended to regression and censored survival data. By considering the penalty for censored observations, the SVM method for regression of censored data (namely, SVCR) was proposed [

The support vector regression for censored data (SVRc) was subsequently proposed to take into account an asymmetric penalty (or loss function) for censored and non-censored data. In terms of the concordance index and the hazard ratio, SVRc performed better than the Cox PH model in five real-life survival analysis datasets [

Ensemble methods are based on the wisdom of “the crowd,” i.e., a new classifier produced by aggregating or voting from a group of classifiers. For example, a group of decision tree classifiers can be produced from different random subsets of a training set. To make a prediction, we may obtain predictions of all the individual decision tree classifiers, and then predict the class with the most votes. Multiple classifiers often predict better than individual classifiers, and appropriately weigh several classifiers, to improve predictability. In this section, we briefly review three ensemble methods: bagging survival trees, random survival forests, and Cox boosting.

The terminology of bagging stands for “bootstrap aggregating,” and the random sampling from a training set is performed repeatedly, with replacement known as B bootstrap samples. We then obtain a set of survival trees, based on B bootstrap samples, and define a new predictor by aggregating all predictors from a set of survival trees. In general, the new predictor can be a statistical mode for classification or an average for regression problem. It is known that the ensemble predictor reduces both bias and variance, compared to a single predictor.

For censored survival data, the averaged point predictor, such as the mean or median survival time, is of minor interest, compared to the predicted conditional survival probability of a new observation. Based on the bagging survival trees, one single KM curve is calculated from the observations identified by the “leaves” of

Like bagging survival trees, the random survival forest is based on random bootstrap samples from a training set, but also allows extra randomness when growing trees. Instead of searching for the same set of variables when splitting a node, random survival forests search for the best variables among a random subset of variables, and these variables are used to split the node by maximizing the log-rank statistic. Similar to the classification and regression problems, random survival forests are an ensemble learner formed by averaging a number of base learners. In survival settings, the base-learner is a survival tree, and the ensemble is a cumulative hazard function formed by averaging individual tree’s Nelson-Aalen’s cumulative hazard function [

When implementing random survival forests, a primary interest is how to select a random subset of variables as candidates for splitting a node. The traditional variable selection in random forests is based on the variable importance (VIMP), a measurement of the increase (or decrease) in prediction error, for the forest ensemble, when a variable is randomly ‘noised up’. However, VIMP is based on the prediction error, and varies considerably, depending on the data with a high-dimensional problem. Alternatively, as described in Ishwaran et al.’s study [

Boosting was originally based on combining multiple weak learners into one strong learner, as proposed in the ML community, especially for classification problems [

The main idea of boosting is to update the predictors sequentially, which at each iteration, fit a weak predictor of the previous version of the data, as updated by minimizing a pre-specified loss function. The obtained value provides a small contribution used to update a new predictor, and all contributions result in a final predictor. Unlike bagging and random survival forests, boosting is a sequential learning technique, implying that it cannot be parallelized. Historically, the AdaBoost method was first proposed [

In survival analysis, most boosting methods have focused on the Cox model, by using gradient boosting, with a loss function derived from the Cox partial likelihood function, as used in the popular R-packages _{2}-norm penalized partial log-likelihood, and updates the estimates of the parameter by maximizing this penalized partial log-likelihood, with a tuning penalty.

Furthermore, to improve the Cox model’s prediction, an offset-based boosting approach was adapted to allow for a flexible penalty structure, including unpenalized mandatory variables, when clinical covariates should be included with high-dimensional omics data [

There are two main parameters to be considered in a boosting procedure: the first controls the weakness of the estimators, known as a penalty or boosting step. The other parameter specifies how many boosting iterations should be performed, which is related to avoiding overfitting, and in component-wise boosting, controls the sparsity of the model. Beside these, other approaches to survival analysis include _{2} boosting [

For specific detection and prediction of breast cancer risk, several ANN models have been developed over the last few decades [

In review of the literature, the ANN method was first applied to survival analysis [

Recently, a neural network extension of the Cox regression model, “Cox-nnet,” was proposed to predict patient prognosis from high-throughput transcriptomics data [

On the other hand, neural network techniques have been developed to overcome the PH assumption of the Cox model, to allow more general relationships between survival time and high-dimensional omics data. For example, the multi-task logistic regression model was developed [

In order to illustrate three different types of methods reviewed, we applied these methods to a real dataset from The Cancer Genome Atlas (TCGA) Genome Data Commons (GDC) portal (

We applied the preprocessing procedure to RNA-seq data of 56,716 genes annotated. The relative log expression (RLE) normalization method was adopted to control the gene length bias. The RLE method was implemented in R package (v3.5) “DESeq2” (v1.22.2) [

First, we applied the traditional method by fitting a Cox model with eleven clinical variables, in which five clinical variables were selected by 3-fold cross validation as shown in

As shown in

In this article, we reviewed statistical methods for survival analysis, focusing on the adaption of traditional methods, regularizations, and ML. We introduced how a KM estimator can determine a survival function, how the log-rank test can compare two survival curves, and use of the Cox regression model. Though there are many other estimators, tests, and models available for survival analysis, the abovementioned methods are the most popularly applied, and the Cox model in particular, is widely adapted for regularization and ML techniques.

With development of high-throughput biological data acquisition, a variety of high-dimensional omics data has rapidly accumulated, allowing for more personalized information used for detection, and prediction of survival probability. Presently, many ML techniques have increasingly been combined with traditional survival methods, and regularization approaches, to provide more accurate diagnosis and prognostic decision-making, in clinical practice (i.e., “precision medicine”). We also briefly reviewed more useful and applicable ML techniques such as survival trees, SVMs, bagging, random survival forests, Cox boosting, and ANNs. Especially, ANNs are more powerful when the relationship between covariates and the outcomes are not linearly correlated and do not restrict any specific functional relationship between covariates and outcomes. By allowing more hidden layers and a variety of flexible open-source deep learning techniques, ANNs are increasingly used for detection and prediction, in biomedical fields.

As described in this article, the classical statistical methods in survival analysis have been well adapted to the ML techniques, to improve survival predictability. This trend will continue to be substantially expanded, in conjunction with deep learning techniques, which are now explosively utilized in many domains (including artificial intelligence).

Conceptualization: SL. Data curation: HL. Formal analysis: HL. Funding acquisition: SL. Methodology: SL. Writing - original draft: SL. Writing - review & editing: SL.

No potential conflict of interest relevant to this article was reported.

This research was supported by the National Research Foundation of Korea (NRF) grant funded by Korea government (MIST) (NRF-2019R1F1A1062005).

Clinical variable selection scheme.

2 × 2 table at time _{i}

Group | Dead | Alive | Risk set |
---|---|---|---|

Treatment | _{1i} |
_{1i}-_{1i} |
_{1i} |

Placebo | _{2i} |
_{2i}-_{2i} |
_{2i} |

Total | _{i} |
_{i}-_{i} |
_{i} |

Result of univariate Cox model for clinical variables

Clinical variable | p-value^{a} |
||
---|---|---|---|

Age at initial pathologic diagnosis | 0.103 | ||

Maximum tumor dimension | 0.001 | ||

Sex | 0.312 | ||

Anatomic neoplasm subdivision | 0.017 | ||

Surgery performed type | <0.001 | ||

Residual tumor | 0.108 | ||

T stage | 0.096 | ||

N stage | 0.053 | ||

Radiation therapy | 0.004 | ||

Postoperative rx tx | <0.001 | ||

Person neoplasm cancer status | <0.001 |

p-value from a univariate Cox model.

Comparison of C-index of using clinical and lasso gene variables

C-index | |||
---|---|---|---|

Method | Clinical | Lasso genes | Clinical + Genes |

Cox model | 0.75 ± 0.06 | 0.60 ± 0.14 | 0.84 ± 0.03 |

SVM | 0.65 ± 0.12 | 0.50 ± 0.03 | 0.74 ± 0.07 |

RSF | 0.73 ± 0.11 | 0.56 ± 0.08 | 0.78 ± 0.08 |

Cox boosting | 0.75 ± 0.06 | 0.60 ± 0.13 | 0.84 ± 0.03 |

Values are presented as mean ± standard deviation.

SVM, support vector machine; RSF, random survival forest.

Comparison of C-index using clinical and elastic net gene variables

Method | C-index | |||||||
---|---|---|---|---|---|---|---|---|

Clinical | E-N gene | Clinical + Genes | ||||||

Cox model | 0.75 ± 0.06 | 0.64 ± 0.09 | 0.79 ± 0.07 | |||||

SVM | 0.65 ± 0.12 | 0.54 ± 0.14 | 0.64 ± 0.16 | |||||

RSF | 0.73 ± 0.11 | 0.61 ± 0.07 | 0.77 ± 0.08 | |||||

Cox boosting | 0.75 ± 0.06 | 0.64 ± 0.09 | 0.80 ± 0.05 |

Values are presented as mean ± standard deviation.

SVM, support vector machine; RSF, random survival forest.