Single-cell RNA sequencing (scRNA-seq) has been widely applied to provide insights into the cell-by-cell expression difference in a given bulk sample. Accordingly, numerous analysis methods have been developed. As it involves simultaneous analyses of many cell and genes, efficiency of the methods is crucial. The conventional cell type annotation method is laborious and subjective. Here we propose a semi-automatic method that calculates a normalized score for each cell type based on user-supplied cell type–specific marker gene list. The method was applied to a publicly available scRNA-seq data of mouse cardiac non-myocyte cell pool. Annotating the 35 t-stochastic neighbor embedding clusters into 12 cell types was straightforward, and its accuracy was evaluated by constructing co-expression network for each cell type. Gene Ontology analysis was congruent with the annotated cell type and the corollary regulatory network analysis showed upstream transcription factors that have well supported literature evidences. The source code is available as an R script upon request.
Next-generation sequencing technology has transformed transcriptomics by allowing simultaneous identification and quantification of the expressed RNA molecules. However, this RNA-sequencing applied to bulk samples provides averaged expression profiles, not single-cell level variations. On the other hand, single-cell RNA-seq (scRNA-seq) isolates single cells from a given bulk sample, and measures the expression profile of a number of RNA species from each cell, offering the potential of cell-type characteristics as well as cell-type profiles of the bulk sample [
scRNA-seq involves simultaneous analyses of many cell and genes, efficiency is crucial. One of the time-consuming steps is cell type annotation. The conventional method is laborious and subjective. To address this problem, some annotation methods have been developed by using additional transcriptomic data as reference coupled with machine-learning technique [
We downloaded raw scRNA-seq datasets from the publicly available ArrayExpress DB (E-MTAB-6173) and Gene Expression Omnibus (GSE92332). A 10× Genomics [
Our scRNA-seq analysis pipeline is based on a well-established practice of processing 10× Genomics data. The sequencing data were processed with Cellranger to obtain an expression matrix of RNAs for each cell. Subsequent processing was performed with Seurat for various quality control steps involving cell filtering, normalization, and removal of technical variation, followed by preliminary analyses such as dimension reduction, clustering, cell type annotation, and differential expression analysis [
STAR was used to map FASTQ reads to mm10-3.0.0 mouse reference genome [
Seurat 3.1.0 was used for principal component analysis, t-stochastic neighbor embedding (t-SNE), clustering, and cell-type annotation of the feature-barcode matrix that had been generated by Cellranger. With the clustered result, each cluster is annotated with a cell type by identifying the dominant expression of a list of cell type–specific markers. The conventional method requires manual examination of a violin plot for each marker of a cell type, followed by confirmation of its high expression in the feature plot. This is a laborious step, often producing imprecise results especially for the cases having a high number of clusters or many similar cell types. In order to reduce human errors, we automated the cell type annotation step using the algorithm that has been well developed for gene-set analysis [
The inputs to CTA comprise a feature-barcode matrix, the cluster membership of each cell, and a list of markers for each cell type of the user’s choice. Suppose the following: there are
where
where
WGCNA 1.68 was used to construct cell type–specific co-expression networks. The Seurat output was parsed to be used as an input to WGCNA. The resulting modules were further analyzed for the enrichment of Gene Ontology terms.
For the co-expressed modules, the potential upstream TFs were inferred using iRegulon 1.3 available as an application of Cytoscape. For the TF binding motif search, the 20 kb upstream region of transcription start site of each gene was used.
The FASTQ files of mouse cardiac cell pools (ArrayExpress E-MTAB-6173) were processed with Cellranger, resulting in the feature-barcode matrix of 11,701 cells. Seurat was used to analyze the single-cell level heterogeneity. The cells having mitochondrial RNAs more than 25% of the total expressed RNAs were removed. In order to remove cells having unrealistic RNA varieties, only the cells with unique feature counts within the range between 200 and 5,000 were kept, resulting in a total of 11,587 cells and 17,432 genes. For the t-SNE clustering, 24 dimensional components as inferred from the principal component analysis and the resolution value of 2.0 were used, resulting in a total of 35 clusters. In order to annotate an appropriate cell type to each cluster we employed the CTA method (see Methods) using 12 cell types having 10 unique marker genes per cell type (
The performance of our CTA annotation method was evaluated with the other scRNA-seq dataset of mouse small intestinal epithelium [
In order to assess the quality of our CTA annotation method, we performed co-expression and co-regulatory network analyses with the clustering and annotation results of the mouse cardiac cell pools [
As shown in
Here we propose an efficient semi-automatic processing pipeline of scRNA-seq data, called CTA method. Its quality was assessed by constructing co-expression and co-regulatory networks using the cell type annotation results. Our results were qualitatively congruent with the literature information. In scRNA-seq, many really expressed RNA species are missed. If this drop-out event can be complemented by imputation, much richer information can be retrieved. However, the current implementation of the imputation such as BISCUIT is very slow and not attempted in this work [
The strategy demonstrated in this work may find useful applications in inferring regulators of various cell types. For example, for the cell types whose critical differentiation regulators are elusive, co-regulatory network construction of progenitor and differentiated cells may elucidate key modules for the differentiation.
Conceptualization: SMY, WK, SK. Data curation: SMY, WK. Formal analysis: SMY, WK. Funding acquisition: SK. Methodology: SMY, WK. Writing – original draft: SMY, WK, SK. Writing – review & editing: WK, SMY, SK.
No potential conflict of interest relevant to this article was reported.
This work is based the Master's degree thesis of SMY at Soongsil University, Seoul, Korea. We acknowledge the financial support from the Soongsil University Research Fund. The computational resources were kindly provided by Korea Institute of Science and Technology Information (GSDC & KREONET).
Supplementary data can be found with this article online at
These plots show cumulative normal distribution of Cell Type Activity (CTA) scores for cardiac non-myocytes. Red lines show threshold for determining cell type of clusters. Dots indicate the clusters (A–F).
These plots show cumulative normal distribution of Cell Type Activity (CTA) scores for small intestine epithelial cells. Red lines show threshold for determining cell type of clusters. Dots indicate the clusters (A–K).
This graph displays final cell type annotation based on Cell Type Activity score of small intestine epithelial cell clusters. Principal component analysis was used for dimension reduction and t-stochastic neighbor embedding (t-SNE) is used for visualization.
Unique marker genes of 12 cell types for mouse cardiac cells
Cell Type Activity score matrix of male and female mouse cardiac cell clusters
Unique marker genes of 11 cell types for small intestinal epithelial cells
Cell Type Activity score matrix of male and female mouse small intestinal epithelial cell clusters.
Soft threshold values for network modularization
Gene Ontology term full text of a dendritic cell exemplary module
Analysis workflow diagram. This diagram illustrates overall workflow used in our study. FASTQ files were processed for alignment, cell filtering, UMI count and feature (genes) count by using CellRanger. CellRanger is a popular pipeline that processes Chromium single-cell 3′ RNA-sequencing data. Next, Seurat was used to generate clusters of cells. Seurat is an R package offering functions for secondary analysis such as cell QC, dimension reduction, clustering and differential expressed gene analysis. After clustering, we used our Cell Type Activity estimation technique for cell type annotation. To identify modules from each cell type, WGCNA was used. WGCNA is an R package that calculates correlation weighted network by using gene expression data and creates clusters of genes. Enrichment analysis study was conducted on each modules and iRegulon was used to identify potential transcription factors co-regulating the gene sets.
t-distributed stochastic neighbor embedding (t-SNE) result of cardiac non-myocytes. This plot displays final cell type annotation based on Cell Type Activity (CTA) scores. Feature-barcode matrix from single-cell RNA sequencing data was used for dimension reduction by using principal component analysis. After dimension reduction, t-SNE was used to visualize the clusters. The cell type for each cluster was inferred from the CTA score.
Dendritic cell exemplary module regulatory network. Co-expression network was conducted to find modules of dendritic cell. Gene list in the module that highly enriched immune functions was analyzed to identify co-regulating transcription factors. Network visualization illustrates interactions between transcription factors (TFs; green) and genes (pink) from the module. The analysis revealed that two TF (
Annotation summary table
Cell | Gene | |
---|---|---|
Fibroblast 1 | 6386 | 16840 |
Macrophage | 1427 | 15773 |
Endothelial cell | 920 | 14594 |
Smooth muscle cell | 712 | 14616 |
Fibroblast 2 | 661 | 15324 |
Pericyte | 425 | 12968 |
B cell | 331 | 12646 |
Dendritic cell | 208 | 13007 |
Schwann cell | 179 | 12416 |
Granulocyte | 134 | 9392 |
T cell | 121 | 11282 |
Natural killer cell | 56 | 10187 |
This chart displays summary of cell types after annotation based on Cell Type Activity score. The amount of cells of each cell type is written at cell column. The number of genes that were expressed in the cell type is showed at gene column.
Co-expression module summary table
Cell type | All modules | GO assigned modules |
---|---|---|
Dendritic cell | 98 | 13 |
T cell | 86 | 5 |
Schwann cell | 81 | 7 |
Pericyte | 81 | 2 |
Granulocyte | 76 | 6 |
Endothelial cell | 67 | 13 |
Smooth muscle cell | 57 | 11 |
B cell | 50 | 7 |
Natural killer cell | 46 | 7 |
Macrophage | 37 | 14 |
Fibroblast 2 | 23 | 5 |
Fibroblast 1 | 12 | 8 |
Each column represents the number of modules generated by co-expression network and the number of modules that were enriched Gene Ontology (GO) term, respectively.
Dendritic cell exemplary module GO term result
Enrichment P | Term ID | Term name |
---|---|---|
5.94E-24 | GO:0035456 | Response to interferon-beta |
2.15E-20 | GO:0006952 | Defense response |
3.99E-20 | GO:0045087 | Innate immune response |
1.61E-19 | GO:0035458 | Cellular response to interferon-beta |
4.47E-19 | GO:0043207 | Response to external biotic stimulus |
4.47E-19 | GO:0051707 | Response to other organism |
1.96E-15 | GO:0098542 | Defense response to other organism |
Enrichment analysis revealed the module that is highly associated with immune Gene Ontology (GO) term. This table represents the list of GO terms from tan module.
Dendritic cell inferred TF summary table
Inferred TF | Target gene count |
---|---|
Stat1 | 156 |
Cebpb | 109 |
Ar | 19 |
Sox9 | 32 |
Atf2 | 21 |
Yy1 | 14 |
Cebpe | 8 |
Mybl2 | 10 |
Snai2 | 9 |
Tbx15 | 27 |
Bdp1 | 12 |
The gene set from the exemplary module of dendritic cell was used for regulatory network analysis, which enables to infer co regulation transcription factor (TF). This chart shows revealed TF names and related gene from the module.