Ovarian Cancer Prognostic Prediction Model Using RNA Sequencing Data
Article information
Abstract
Ovarian cancer is one of the leading causes of cancer-related deaths in gynecological malignancies. Over 70% of ovarian cancer cases are high-grade serous ovarian cancers and have high death rates due to their resistance to chemotherapy. Despite advances in surgical and pharmaceutical therapies, overall survival rates are not good, and making an accurate prediction of the prognosis is not easy because of the highly heterogeneous nature of ovarian cancer. To improve the patient’s prognosis through proper treatment, we present a prognostic prediction model by integrating high-dimensional RNA sequencing data with their clinical data through the following steps: gene filtration, pre-screening, gene marker selection, integrated study of selected gene markers and prediction model building. These steps of the prognostic prediction model can be applied to other types of cancer besides ovarian cancer.
Introduction
The genetic background of patients with complex diseases, such as cancer, has been continually studied. Traditional attempts mainly focus on finding unique differentially expressed genes (DEGs) for diagnosis or survival times using microarray techniques [1, 2]. Genetic markers can be indicators of the activity state of a pathway of therapy, showing their potential as prognostic predictors for specific treatments. Ovarian cancer, especially high-grade serous ovarian cancer (HGSC), is one of the most lethal gynecological malignancies in women. Vague symptoms and a lack of robust biomarkers for detection are the main causes of difficulties in making an early diagnosis. Major factors associated with the poor prognosis and poor management of ovarian cancers include the late discovery of the disease, chemotherapy resistance, and a lack of clinical variables that are crucial for accurate prognostic predictions [3]. Thus, there is a need to identify new biomarkers that can be used to improve the treatment of ovarian cancer patients [4].
Previous studies have mainly focused on identifying novel molecular markers or subtyping through molecular markers of HGSC [5,6]. In this study, we aim to identify clinical and genetic markers of the prognosis of HGSC. Through analyzing high-dimensional genetic information, we can gain a better understanding of HGSC and its biological mechanism. For cancers with a poor prognosis, genetic information can be effective in improving the prognosis of patients based only on clinical information [3].
Next-generation sequencing (NGS) technology has made it possible to generate mRNA expression data for the tens of thousands of genes from only a few hundred samples [7]. These RNA sequencing (RNA-Seq) data have the advantage of knowing the sequence count directly, without being affected by the background noise, unlike microarray data. RNA-Seq data also contain more genes to discover than microarray data. High-throughput NGS-based tumor genome profiling has significantly advanced our understanding of the molecular variability exhibited by tumors.
However, the nature of RNA-Seq data, consisting of non-negative count data, requires a new predictive model that differs from that of microarray data. For RNA-Seq data, it is difficult to apply a traditional statistical approach because of its relatively small number of samples compared with the large number of genes. Therefore, we present a process for finding genes that play important roles in the outcome and specific procedures for designing and evaluating a model, based on the characteristics of RNA-Seq data.
The purpose of this study is to find gene markers that can significantly affect the prognosis, when analyzing RNA-Seq data, and to present a protocol that can improve the predictive performance using transcription data by integrating clinical information. Specifically, we present a protocol to build a prediction model for the prognosis of HGSC, including (1) selection of markers and clinical variables and (2) model construction and evaluation.
Methods
The data used in this study were clinical information and RNA-Seq data from The Cancer Genome Atlas (TCGA) Genomic Data Commons (GDC) portal (https://portal.gdc.cancer.gov). The National Institutes of Health (NIH)’s TCGA webpage provides clinical information, biomedical slide images, molecular information, CpG methylation information, DNA copy number, mRNA expression, and miRNA expression. RNA-Seq data were used, because large amounts of gene expression could be explored collectively from samples in the study. We downloaded genomic alteration data and expression data in patients with high-grade serous ovarian cancer and the corresponding clinicopathological profiles from the Genomic Data Commons (https://portal.gdc.cancer.gov), Firebrowse (http://firebrowse.org), and cBioPortal for Cancer Genomics (http://www.cbioportal.org) web portals. This study complied with TCGA publication guidelines and policies (http://cancergenome.nih.gov/publications/publicationguidelines). The Institutional Review Board of Seoul National University Hospital ruled that no formal ethical approval was required in this study.
Survival time refers to the period from the diagnosis of HGSC in patients to death. Clinical variables were selected, based on their influence on the outcome. Among 365 HGSC cases, the median survival time for the samples was 43.9 months, and the censoring proportion of the data was 55%.
Pre-processing
HGSC patient data collected from the GDC portal can be separated into two categories. Since RNA-seq data from the GDC portal consist of raw count data, normalization was required to control for the sample bias and gene length bias. Two common normalization methods were applied. The first method is called relative long expression (RLE) normalization, implemented in the R package “Deseq2” [8], which is also implemented in the R package “edgeR” [9]. The RLE method utilizes the geometric mean of the read count, whereas the TMM method estimates normalizing factors after extreme expressions are removed to get more a robust estimation. A simple simulation study for DEGs has been performed for comparison [10]. Both methods were used to give more specific comparison results in the model building.
Since the clinical data had many missing values, we imputed missing clinical variables with the R package “mice,” which performs chained equations to find estimates for missing values using Gipps sampling [11]. There were no tendencies in the missing values; so, the missing-at-random assumption was applied [11]. After imputation, significant clinical variables were chosen using a Cox regression model via stepwise selection methods.
Clinical data and RNA-Seq data were integrated and divided into training and test sets. To avoid unwanted effects of censoring, balanced sampling in terms of censoring status was used to make the training and test sets have the same censoring rate.
Analysis plan
In this study, we considered a prediction model and process in terms of performance and interpretability. The main steps of our analysis were as follows: (1) gene filtration, (2) pre-screening, (3) gene marker selection, and (4) integrated study of selected gene markers and the final prediction model.
Gene filtration
Gene filtration for low expression is important because it can reduce false positives [12]. Sampling noise that is added during the sequencing process can also be removed by gene filtration. In this study, a 20% zero proportion threshold was selected, as this value is commonly chosen in previous studies [13]. Gene filtration was performed only on the training set.
Pre-screening
Pre-screening of gene markers was used after gene filtration. This process is necessary for actual computation and controlling false positives. Unsupervised filtering using median absolute deviation (MAD) was used. MAD is widely used in microarray data and single-cell RNA-Seq data as a pre-screening method. This measure can be used to select more variable genes. Compared with standard deviation, MAD is a more robust measure [14]. Since the raw count data consist of sparse and non-negative numbers, MAD was used to represent variability. The genes with MAD were selected for model building.
Another supervised pre-screening method was used, based on univariate Cox regression with adjustment of clinical variables. This Cox regression procedure was performed only for the training set to avoid overfitting. Next, genes were selected, based on individual p-values. Both methods were jointly applied to select candidate gene markers.
Gene selection
Although the pre-screening process greatly reduced the number of genes, there remained the problem of an ill-posed model, since the sample size was very small compared to the number of features. Therefore, sequential variable selection using a multiple Cox regression model was not applicable. To solve such a problem, as presented below, multiple Cox regression using penalized partial likelihood was used for variable selection and shrinkage estimation.
where
Modified partial likelihood was also used to select clinical variables more favorably, since some clinical variables have already been reported to be significant.
Thus, the penalty factor γ is used to reduce the penalty of clinical variables.
The partial likelihood estimation process explained above is implemented in the R package “glmnet,” also known as “coxnet” [15].
Prediction model building and detailed study of selected genes
Along with selected genetic markers and clinical variables we fitted a multiple Cox regression model. For further integrated study of the selected gene markers, we grouped patients into two groups in terms of fitted hazard ratios as high- and low-risk groups. Then, log rank test was performed to test for homogeneity of survival rates between the two groups. Also, a functional study for the selected gene markers using a pathway database was provided for biological understanding [16].
Results
To build and validate the prediction models, the whole dataset was divided into training and test datasets at a 2:1 ratio.
Clinical results
Descriptive statistics of the clinical information are summarized in Table 1. First, all categorical variables were represented by dummy variables. Note that each variable has less than 30% of observations missing. When we put all variables together, however, they made up more than 40.55% of observations with at least one variable missing. To include cases with missing variables, imputation was performed using the “mice” package.
Variable selection of clinical variables was performed for Cox regression by using stepwise selection procedures. The results are summarized in Table 2. A total of 6 clinical variables were chosen out of 26 variables. Among the 6 variables, platinum status and primal therapy outcome were significant for prognosis at a 5% significance level.
Gene marker selection
After applying a 20% zero proportion threshold for filtration, a total of 42,747 genes were selected. For ease of interpretation, only protein-coding genes and microRNA genes, which numbered up to 18,415 genes, were used. We performed two pre-screening processes. The first one used MAD.
Genes with MAD under the 10% quantile were filtered out to control false positives. The second filtering was performed using p-values of the gene markers. After that, genes with p-values larger than 0.3 were removed. After the prescreening, 6,161 genes remained. We then fit the modified penalized Cox regression model as explained in the Methods section. Among the many values, 0.2 was chosen as the gamma parameter to be effective for both marker selection and prediction in this study. Finally, three genes—REN, LEFTY1, and AP1S2—were selected, along with 6 clinical variables. These selected markers are summarized in Table 3.
Integrated study of selected gene markers
Along with the selected genetic markers and clinical variables, we fitted a multiple Cox regression model. The genetic markers chosen for the final model were REN, LEFTY1, and AP1S2. From the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway database [16], we found out that these genes belong to the following pathways, respectively: KEGG RENIN ANGIOTENSIN SYSTEM, KEGG TGF BETA SIGNALING PATHWAY, and KEGG LYSOSOME. The false discovery rate q-values from the univariate Cox regression with a single gene were 0.693, 0.687, and 0.138, respectively. These q-values mean that those genetic markers cannot be selected through the false discovery rate procedure with a level of 0.1 via univariate Cox regression.
For further integrated study of the selected gene markers, we grouped patients into two groups—high- and low-risk— using the fitted hazard ratios. As shown in Fig. 1, the two groups were well separated, with a log-rank test p-value of 8.02×10−6. We could see that our model predicts the prognosis of patients quite well. The three selected genes have been reported to have an association with ovarian cancer in previous studies. The renin-angiotensin system works as an angiogenic factor through type 1 angiotensin receptors. These angiogenic factors have a correlation with ovarian cancer patient survival [17]. Lefty1 is an activator of the TCEA3 gene, the expression of which is related to cell death in ovarian cancer [18]. Lastly, AP1S2 has been reported as a prognostic marker of ovarian cancer, and its expression level is differentially changed in drug-resistant cell lines [19, 20].
Prediction model results
Two measures of assessment were considered. One was Harrell’s concordance index, and the other was the 2-year time-dependent area under the receiver operating characteristic curve (AUC) [21, 22]. For purposes of comparison, we considered three models: a multiple Cox model with only clinical variables (M1), penalized Cox regression (M2), and our proposed method (M3). Table 4 shows the C-index and time-dependent AUC of each model. Among the three models, penalized Cox regression M2 showed even worse results than M1 alone with the clinical variables, whereas our method, M3, showed the best performance.
The time-dependent receiver operating characteristic curves for the training sets and test sets are shown in Fig. 2. In most of the range of Fig. 2, the curve of M3 is higher than the other curves on in the upper left corner, showing better performance with regard to prognostic prediction.
Discussion
In this study, we presented the process of building prediction models, based on clinical variables and high-dimensional RNA-Seq data for HSGC patients provided by the TCGA GDC portal. The specific steps were as follows: gene filtration, pre-screening, gene marker selection, and integrated study of selected gene markers with the final prediction model.
To put more emphasis on the clinical variables, we applied modified penalized Cox regression. We first selected clinical variables that had major impacts on the survival outcome of HGSC, and then, we found genes using these clinical variables with higher weights than genes in the penalized Cox regression. As shown in Table 5, six clinical variables and three genes were chosen as the markers for prognostic prediction. While these three genes were reported to be associated with HGSC, no significant results were found in the single-gene analysis.
When analyzing data, a single marker alone may not be sufficient to explain the outcome. For example, the RENIN gene itself cannot distinguish survival patterns of ovarian cancer patients well. However, a model that includes additional gene markers can contribute to improving the predictability by further explaining the areas that cannot be explained by clinical variables [20].
Furthermore, it is important to keep in mind that the issue of false positives can arise, since the analysis is conducted with a relatively small number of samples compared to the number of genes. Therefore, it is important to determine which genes to use in the Cox model by considering the appropriate gene filtering criteria.
We proposed procedures for building prediction models for predicting survival outcomes by integrating RNA-Seq data and clinical information for HGSC. The approach in this study is general, in the sense that it may not be highly dependent on the characteristics of the cancer type or other types of biomarkers. Therefore, this predictive model development process could be easily applied to other types of cancer.
In the future, we hope to apply our approach to other types of genomic data, such as DNA methylation and copy number alterations. In addition, we want to build our integrative prediction model to predict survival times more accurately for other cancer patients.
Acknowledgments
This research was supported by a grant of the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (grant number: HI16C2037).
Notes
Conflicts of Interest
No potential conflict of interest relevant to this article was reported.
Authors’ contribution
Conceptualization: TA, YSS, TP
Data curation: SJ, LM, SIK
Formal analysis: SJ, LM, TA
Funding acquisition: YSS, TP
Methodology: SJ, LM, TA, TP
Writing – original draft: SJ, LM, TP
Writing – review & editing: TA, YSS, TP