A survival prediction model has recently been developed to evaluate the prognosis of resected nonmetastatic pancreatic ductal adenocarcinoma based on a Cox model using two nationwide databases: Surveillance, Epidemiology and End Results (SEER) and Korea Tumor Registry System-Biliary Pancreas (KOTUS-BP). In this study, we applied two machine learning methods—random survival forests (RSF) and support vector machines (SVM)—for survival analysis and compared their prediction performance using the SEER and KOTUS-BP datasets. Three schemes were used for model development and evaluation. First, we utilized data from SEER for model development and used data from KOTUS-BP for external evaluation. Second, these two datasets were swapped by taking data from KOTUS-BP for model development and data from SEER for external evaluation. Finally, we mixed these two datasets half and half and utilized the mixed datasets for model development and validation. We used 9,624 patients from SEER and 3,281 patients from KOTUS-BP to construct a prediction model with seven covariates: age, sex, histologic differentiation, adjuvant treatment, resection margin status, and the American Joint Committee on Cancer 8th edition T-stage and N-stage. Comparing the three schemes, the performance of the Cox model, RSF, and SVM was better when using the mixed datasets than when using the unmixed datasets. When using the mixed datasets, the C-index, 1-year, 2-year, and 3-year time-dependent areas under the curve for the Cox model were 0.644, 0.698, 0.680, and 0.687, respectively. The Cox model performed slightly better than RSF and SVM.
Pancreatic cancer is well-known as one of the most lethal cancers worldwide because it has a 5-year overall survival rate of 12.6% as of 2020, while other cancers have 5-year overall survival rates of over 80%. The survival rate strongly depends on the stage of cancer and disease severity. For example, in patients with stage I pancreatic cancer, the 5-year postoperative survival rate is 70.32%, while in patients with stage IV cancer, the 5-year postoperative survival rate is only 3.52%. Therefore, early diagnosis and prediction have been considered promising ways to improve the survival rate of pancreatic cancer.
A survival prediction model for resected pancreatic ductal adenocarcinoma (PDAC) was recently developed with data from the Surveillance, Epidemiology and End Results (SEER) database from the United States and external validation using the nationwide Korea Tumor Registry System-Biliary Pancreas (KOTUS-BP) dataset [
In this study, three different schemes were conducted for model development and external evaluation. First, we utilized data from SEER for model development and data from KOTUS-BP for external evaluation. Secondly, these two datasets were used in reverse by taking data from KOTUS-BP for model development and data from SEER for external evaluation. Finally, we mixed these two datasets half and half and utilized the mixed datasets for model development and external validation.
For each of the three different schemes, we developed prediction models using a Cox proportional hazards model, RSF, and SVM. We compared their performance in terms of the C-index and 1-year, 2-year, and 3-year time-dependent areas under the curve (AUCs).
This study utilized two nationwide databases: the SEER database from the United States and the KOTUS-BP database from Korea. The datasets were pre-processed as described elsewhere [
The SEER database, which has been maintained by the National Cancer Institute in the United States since 1975, is one of the largest and highest-quality cohort studies, whereas the KOTUS-BP database was launched by the Korean Association of Hepato-Biliary-Pancreatic Surgery in 2014 and has been prospectively registered and regularly managed by pancreatobiliary surgeons at specialized centers in Korea. To unify the study period, patients who underwent upfront curative-intent pancreatectomy between 2004 and 2016 were included.
Three schemes for model development and external validation were conducted. First, we utilized data from SEER for model development and data from KOTUS-BP for validation. Second, we swapped the roles of these two datasets for model development and evaluation. Finally, we mixed these two datasets half and half, and utilized the mixed datasets for model development and external validation.
The Cox proportional hazard (Cox-PH) model and the two ML models had different schemes for the model development process, as shown in
In prospective cohort studies, survival analysis has been useful to investigate the prognostic factors associated with the survival time and to predict disease processes. In traditional survival analysis, a survival prediction model has been constructed on the basis of demographic and clinicopathologic information. In recent years, there has been considerable interest in applying ML methods to predict the survival of cancer patients using a considerable amount of genomic information including traditional clinical covariates. An advantage of ML methods over the classical Cox regression models is their ability to model complicated associations between the survival time and risk factors, leading to better prediction. Unlike regression and classification settings, standard ML methods cannot be directly applied to censored survival data. With consideration of the censoring mechanism, several ML methods have been extended to survival data, such as bagging survival trees [
The RSF method is an extension of Breiman’s random forest method to right-censored survival data by using a forest of survival trees for prediction. Similar to regression and classification settings, RSF is an ensemble learner formed by averaging a tree base-learner. In survival settings, a binary survival tree is the base-learner, and the ensemble learner is formed by averaging each tree’s Nelson-Aalen’s cumulative hazard function.
There are four main steps in RSF: (1) Draw
The SVM method of supervised learning has been very successful, mostly in classification and then extended to the regression problem. The main idea of SVM is to minimize the
To take into account censored survival data, SVM for regression on the censored data (SVCR) has been proposed by imposing constraints on the SVM formulation for two comparable cases [
In the SVM model, overall survival time, y, is explained by the clinical variables x as y=φ(x)+ϵ, where φ(∙) is called the feature map. Since the feature map usually implies a higher-dimensional space, it is unusual to calculate the feature map itself. Instead, the feature map is directly calculated by kernel k(
Although SVCR and RankSVMs share the same framework to a certain extent, they differ in terms of how they utilize information for their ultimate objective. SVCR is designed to directly predict the survival time and to minimize the absolute error between predicted and observed survival times. In contrast, RankSVMs focuses on predicting the correct ranking of survival times rather than predicting the actual survival time. In this respect, SVCR extends the standard support vector regression to censored data by penalizing incorrect predictions of censored observations [
All statistical analyses were done using R version 3.6.2 (The R Foundation for Statistical Computing, Vienna, Austria). The only continuous covariable, age, was reported as the mean ± standard deviation, and the other categorical variables were reported as frequencies with percentages, as shown in
Two Kaplan-Meier survival curves were compared using the log-rank test, as shown in
For the implementation of RSF, the number of binary decision trees, the maximum variables for splitting in each node, and the splitting rules for measuring survival differences are shown in
The number of trees was 50, 100, 200, 500, and 1,000, and the variables for splitting were given as 10. Although there were seven variables, three variables (histologic differentiation, AJCC 8th edition T-stage, and N-stage) had one more additional variable after one-hot encoding. Three different split rules were applied: log-rank splitting [
To implement SVM, 80 models were considered from combinations of various hyperparameters: two SVM models (SVCR and RankSVMs), two types of kernels (linear and clinical kernels), two ways of computing distance between data points (makediff1 and makediff3), and 10 values of the regularization parameter γ as shown in
Based on three survival predictive models, we investigated personalized treatment policies using the survival rate over time. It is well known that the Cox model assumes a proportional HR over time, which implies that the HRs between different individuals are constant over time. However, the two ML models used in this study reflect more complex interactions between covariates and yield non-constant HRs between different individuals over time. For personalized treatment, it would be more desirable to predict the survival rate over time using the ML models than using the Cox model.
Through 10-fold CV of 150 RSF models, the model with the best validation Harrell C-index was chosen as the final RSF model. The final RSF model consisted of 100 decision trees, used a maximum of two variables in splitting nodes, and used the log-rank test to measure survival differences when two datasets were mixed half and half. The C-index and 1-year, 2-year, and 3-year time-dependent AUCs were 0.6337, 0.6824, 0.6681, and 0.6781, respectively.
Similarly, the model with the best validation Harrell C-index was chosen through 10-fold CV of 80 SVM models. The final SVM model was the SVCR model based on an additive clinical kernel, regularization constant (γ) of 0.1, and the makediff3 method to calculate the distance between data points when the two datasets were mixed half and half. The C-index and 1-year, 2-year and 3-year time-dependent AUCs were 0.6233, 0.6849, 0.6352, and 0.6264, respectively.
The C-index and 1-year, 2-year, and 3-year time-dependent AUCs of the Cox model were 0.6434, 0.6976, 0.6795 and 0.6873, respectively. Comparing these values to those of the two ML survival models, the Cox model consistently performed slightly better than RSF and SVM models. The Cox model also yielded slightly better results when the two datasets were mixed half and half than when the two datasets were not mixed.
In order to consider personalized treatment policies, we compared the predictive survival curves of three different patients using the fitted Cox model and the final RSF model described above. Suppose that the three chosen patients (A, B, and C) are all 50-year-old women, have a tumor in the body or tail of the pancreas, and have not received chemotherapy. Patient A has a well differentiated tumor staged T1 and N0 according to the AJCC 8th edition staging system. Patient B has a moderately differentiated tumor staged T2 and N1, whereas patient C has a poorly differentiated tumor staged T3 and N2. We plotted two predicted survival curves from both the Cox model and RSF model for these three patients over time, as shown in
In light of the development of a predictive survival model for PDAC [
Compared with the Cox model, the performance of the ML survival models was not significantly improved, and RSF performed similarly to the Cox model. However, the performance of SVM differed substantially according to how the survival information was used. The performance of SVCR was comparable to those of the Cox model and RSF, since SVCR utilizes the survival time in the regression model considering the censoring mechanism. In contrast, the performance of RankSVMs was not good because this method only uses the ranking information of the survival times.
The RSF and SVM showed no substantial improvements in performance compared to the Cox model. In this study, only seven clinical variables were shared between the SEER and KOTUS-BP datasets, which might have been too few to maximize the usefulness of ML methods. ML methods are useful to analyze more complex and nonlinear associations among high-dimensional variables such as genetic information. It was also noted that the Harrell C-index of all models, both in the training set and in the test set, was less than 0.70, except for one or two cases.
Although it takes more time to develop ML survival models than a Cox model and there is no substantial performance improvement, these ML survival models have the advantage of allowing nonlinear risk to be predicted over time. As shown in
Conceptualization: SL, HK, JJ, TP. Data curation: JJ, HK. Formal analysis: HK. Funding acquisition: SL, TP. Methodology: SL, HK, TP. Writing - original draft: HK, SL. Writing - review & editing: TP, JJ.
Taesung Park serves as an editor of the Genomics and Informatics, but has no role in the decision to publish this article. All remaining authors have declared no conflicts of interest.
This work was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF), funded by the Ministry of Education (2019R1F1A1062005) and a grant of the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (grant number: HI16C2037).
Flowchart of model development and external validation process for the Cox, random survival forests, and support vector machines models. SEER, Surveillance, Epidemiology and End Results; Cox PH, Cox proportional hazard; KOTUS-BP, Korea Tumor Registry System-Biliary Pancreas; CV, cross-validation.
Kaplan-Meier survival curves with 5-year overall survival (OS) rates and median survival times for the Surveillance, Epidemiology and End Results (SEER) and Korea Tumor Registry System-Biliary Pancreas (KOTUS-BP) datasets.
Hazard ratios and 95% confidence intervals of seven variables in the Surveillance, Epidemiology and End Results (SEER) and Korea Tumor Registry System-Biliary Pancreas (KOTUS-BP) datasets. AJCC, American Joint Committee on Cancer.
Overlaid predicted survival curves of the Cox model and random survival forests (RSF) method for three patients. Cox PH, Cox proportional hazard.
Basic statistics and 5-year overall survival rates for seven variables in the SEER and KOTUS-BP databases
Variable | SEER database (n=9,624) | KOTUS database (n=3,281) | ||||
---|---|---|---|---|---|---|
Patients | 5-Year OS (%) | p-value^{a} | Patients | 5-Year OS (%) | p-value^{a} | |
Age (yr) | 65.6±10.4 | 20.1 | 63.8±10.1 | 32.2 | ||
Female | 4,755 (49.4) | 21.3 | 1,381 (42.1) | 36.2 | ||
Male | 4,869 (50.6) | 18.9 | 0.006 | 1,900 (57.9) | 29.2 | 0.146 |
Head | 8,079 (83.9) | 19.2 | 2,046 (62.4) | 28.5 | ||
Body/Tail | 1,545 (16.1) | 25.0 | 0.002 | 1,235 (37.6) | 37.8 | <0.001 |
No adjuvant treatment | 2,948 (30.6) | 17.3 | 2,006 (61.1) | 29.5 | ||
Adjuvant treatment | 6,676 (69.4) | 21.3 | <0.001 | 1,275 (38.9) | 36.1 | <0.001 |
Well differentiated | 1,013 (10.5) | 37.4 | 376 (11.5) | 44.9 | ||
Moderately differentiated | 5,055 (52.5) | 20.5 | <0.001 | 2,362 (72.0) | 32.9 | <0.001 |
Poorly differentiated | 3,556 (37.0) | 14.6 | <0.001 | 543 (16.5) | 20.8 | <0.001 |
T1 | 1,603 (16.7) | 32.7 | 672 (20.5) | 45.3 | ||
T2 | 5,830 (60.6) | 18.8 | <0.001 | 2,007 (61.2) | 29.7 | <0.001 |
T3 | 2,191 (22.7) | 14.3 | <0.001 | 602 (18.3) | 24.5 | <0.001 |
N0 | 3,155 (32.8) | 32.4 | 1,313 (40.0) | 42.6 | ||
N1 | 4,030 (41.9) | 20.5 | <0.001 | 1,347 (41.1) | 28.5 | <0.001 |
N2 | 2,439 (25.3) | 14.6 | <0.001 | 621 (18.9) | 16.4 | <0.001 |
Values are presented as mean±SD or number (%).
SEER, Surveillance, Epidemiology and End Results; KOTUS-BP, Korea Tumor Registry System-Biliary Pancreas; OS, overall survival.
Log-rank test.
Hyperparameters for random survival forests
Hyperparameter | Value |
---|---|
No. of trees | 50, 100, 200, 500, 1,000 |
Max. variables used in split | 1‒10 |
Splitting rule | log-rank/bs.gradient/logrankscore |
One-hot encoded variables: differentiation, AJCC 8th edition T and N staging.
AJCC, American Joint Committee on Cancer.
Hyperparameters for support vector machines for survival analysis
Hyperparameter | Value |
---|---|
SVM type | SVRC, RankSVMs |
Kernel | Linear, clinical |
Distance matrix | Makediff1, makediff3 |
Regularization constant | 0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1, 2, 5, 10 |
SVM, support vector machines; SVRC, support vector regression for censored data.
C-index and 1-year, 2-year, 3-year time-dependent AUCs for the Cox, RSF, and SVM models according to three schemes
Model | Training |
Test |
||||||
---|---|---|---|---|---|---|---|---|
C-index | Td1 AUC | Td2 AUC | Td3 AUC | C-index | Td1 AUC | Td2 AUC | Td3 AUC | |
Training (SEER) | Test (KOTUS) | |||||||
Cox | 0.65417 | 0.72545 | 0.68776 | 0.68765 | 0.62792 | 0.65489 | 0.66759 | 0.68153 |
RSF | 0.66520 | 0.72960 | 0.70807 | 0.71722 | 0.63344 | 0.66660 | 0.67675 | 0.69104 |
SVM | 0.64218 | 0.72258 | 0.65812 | 0.64074 | 0.59956 | 0.61514 | 0.62619 | 0.63458 |
Training (KOTUS) | Test (SEER) | |||||||
Cox | 0.65074 | 0.69346 | 0.69524 | 0.70095 | 0.62932 | 0.68365 | 0.67008 | 0.67426 |
RSF | 0.66293 | 0.70624 | 0.71295 | 0.71676 | 0.62189 | 0.67445 | 0.65885 | 0.66058 |
SVM | 0.62668 | 0.66973 | 0.66769 | 0.66072 | 0.60061 | 0.64794 | 0.63057 | 0.62372 |
Training (SEER + KOTUS) | Test (SEER + KOTUS) | |||||||
Cox | 0.64890 | 0.70718 | 0.69108 | 0.69327 | 0.64361 | 0.69764 | 0.67953 | 0.68726 |
RSF | 0.66396 | 0.71328 | 0.72110 | 0.73110 | 0.63363 | 0.68239 | 0.66810 | 0.67806 |
SVM | 0.62538 | 0.69700 | 0.64029 | 0.61994 | 0.62333 | 0.68489 | 0.63515 | 0.62643 |
AUC, area under receiver operating characteristic curve; RSF, random survival forests; SVM, support vector machines; C-index, Harrellâ€™s concordance index; Td1, 1-year time-dependent; Td2, 2-year time-dependent; Td3, 3-year time-dependent; KoTUS, Korea Tumor Registry System-Biliary Pancreas; SEER, Surveillance, Epidemiology and End Results.