Functional annotation of lung cancer‒associated genetic variants by cell type‒specific epigenome and long-range chromatin interactome
Article information
Abstract
Functional interpretation of noncoding genetic variants associated with complex human diseases and traits remains a challenge. In an effort to enhance our understanding of common germline variants associated with lung cancer, we categorize regulatory elements based on eight major cell types of human lung tissue. Our results show that 21.68% of lung cancer‒associated risk variants are linked to noncoding regulatory elements, nearly half of which are cell type‒specific. Integrative analysis of high-resolution long-range chromatin interactome maps and single-cell RNA-sequencing data of lung tumors uncovers number of putative target genes of these variants and functionally relevant cell types, which display a potential biological link to cancer susceptibility. The present study greatly expands the scope of functional annotation of lung cancer‒associated genetic risk factors and dictates probable cell types involved in lung carcinogenesis.
Introduction
Gene regulation is a critical biological process that determines cellular identity and function. Systematic investigation of chromatin architecture has shown that the spatiotemporal gene regulation process is tightly controlled by cis-regulatory elements (cREs), which modulate the activities of spatially distant promoters [1-4]. Previous studies have reported that cREs are highly dynamic genomic entities whose dysregulation is associated with various human disorders, including genetic and complex diseases [5-7]. This is well described in the 2019 professional release of the human gene mutation database (HGMD), where more than 4,500 disease-associated mutations are in regulatory sequences [8]. The distal target genes of cREs harboring genetic mutations have been identified as causal elements in human diseases. The representative example is polydactyly syndrome, a congenital limb malformation that results from point mutations in a Shh regulatory element [9]. However, the identification of such long-range regulation is difficult since cREs may regulate genes located beyond large genomic distances. In light of this, Hi-C is a novel technique that enables the investigation of genome-wide, all-to-all chromatin interactions and has substantially improved our view of cREs on long-range gene expression control through 3D chromatin organization [10-14].
The pathogenesis of complex diseases is an outcome of a heterogeneous cell population and various causal genetic variants. The causal genetic variants and the functional cell type in which these disease-associated variants may be active are often unclear. Although the recent development of single-cell RNA sequencing (scRNA-seq) technology has allowed the assessment of the cell type-specific transcriptome, gene regulation that underlies the functional properties of complex diseases is not well understood. Therefore, we performed a comprehensive analysis of cell type-specific gene regulatory mechanisms by integrating publicly available data on cell type-specific epigenome, single-cell transcriptome, and 3D chromatin interactome surrounding human lung cancer. We explored the dynamics of epigenomic profiles of eight major cell types that exist in the tumor environment and the relationships of cell type‒specific regulatory elements with genetic risk variants. By integrating high-resolution chromatin contact maps and scRNA-seq data, we characterize cell type dependent expression profile of putative interaction target genes of lung cancer‒associated variant-harboring cREs, and identify a list of potential candidate genes associated with lung carcinoma and their relevant cell type.
Methods
Epigenome profiling of regulatory genome elements
Nineteen H3K27ac chromatin immunoprecipitation followed by high-throughput DNA sequencing (ChIP-seq) datasets were downloaded from the Encyclopedia of DNA Elements (ENCODE) database, representing seven cell populations: epithelial (2 primary epithelial cells from mammary gland, 1 primary epithelial cell from prostate), fibroblast (2 primary fibroblast cells from lung and IMR90 cell line), endothelial (2 primary cells from umbilical vein, 1 primary cell from brain microvasculature), T lymphocyte (1 primary T cell, 1 naïve thymus-derived CD4-positive primary cell, and 1 CD8-positive primary cell), natural killer (NK) cell (1 natural killer primary cell), B lymphocyte (2 primary B cells, and GM12989 cell line), myeloid (3 CD14-positive primary monocyte cells) [2,3]. H3K27ac ChIP-seq data for 3 lung cancer cell lines (A549, A427, and H322) were downloaded from the DBTSS database [15]. H3K27ac ChIP-seq data for two alveolar lung epithelial cells representing squamous type 1 (AT1) cells and cuboidal type 2 (AT2) cells were downloaded from NCBI Gene Expression Omnibus (GEO) database (accession code: GSE84273) [16]. Detailed sample information, biosample ID, and library ID for raw ChIP-seq data are described in Supplementary Table 1. The ChIP-seq reads were aligned against the human genome (hg19) using BWA-mem with default parameters. Non-uniquely mapped, low-quality (MAPQ < 10), and PCR duplicate reads were removed. Peak calling of individual ChIP-seq experiments was performed with MACS2 callpeak with a p-value threshold of 1e-5 and with a respective input control used as the background [17]. cREs were obtained by selecting H3K27ac peaks located distal to promoters (>2.5 kb from transcription start site [TSS]), merging the peaks across the samples, and stitching peaks within a 3 kb distance. For the quantification of cRE activity, reads per million (RPM) values were calculated and quantile normalized across the samples for comparative analysis.
Collection of genome-wide association study‒single nucleotide polymorphisms associated with lung cancer
A total of 286 genome-wide association study (GWAS)‒single nucleotide polymorphism (SNPs) related to lung cancer were obtained from the NHGRI-EBI GWAS catalog (downloaded on August 20, 2019) [18], targeting nine traits as follows: familial lung adenocarcinoma, familial lung cancer, familial squamous cell lung carcinoma, non‒small cell lung cancer, non‒small cell lung cancer (recurrence rate), non‒small cell lung cancer (survival), small cell lung cancer (survival), small cell lung carcinoma, and squamous cell lung carcinoma. We expanded these tag SNPs by using linkage disequilibrium (LD) information (r2 > 0.8). The LD scores were calculated using PLINK for five ethnic populations obtained from 1000 Genomes Phase 3 data. Tight LD associations (r2 > 0.8) recurrent in at least three ethnic groups were used for LD expansion. The number of total LD-expanded SNPs was 2,128, and these SNPs were stored in a manner that each of them was traceable back to its parental tag SNP.
Mapping 3D long-range chromatin interactions
To obtain high-resolution chromatin contact maps in lung tissue, we downloaded in situ Hi-C data for A549 (lung carcinoma cell line), IMR90 (lung fibroblast cell line), GM12878 (lymphoblastoid cell line), and HMEC (human mammary epithelial cell line) cells from the ENCODE database [2,3]. Detailed sample information, biosample ID, and library ID for the raw in situ HiC data are described in Supplementary Table 2. Raw Hi-C sequenced reads were mapped to the human reference genome (hg19) using BWA-mem (-M option). An in-house script was used to remove low-quality reads (MAPQ < 10), the reads that span ligation sites, chimeric reads, and self-interacting reads in which two fragments are located within 15 kb. The read pairs were merged together as paired-end aligned BAM files, and PCR duplicates were removed with Picard. Statistically significant contacts in Hi-C data were identified at a 5 kb resolution using Fit-Hi-C [19]. We used the default Fit-Hi-C code to calculate the Q-value for each bin pair within a 1 Mb genomic window. A Q-value threshold (Q < 0.01) was used to define significant chromatin contacts.
Identification of cell type-specific gene expression using scRNA-seq data
The raw UMI count matrix of scRNA-seq data for lung adenocarcinoma was downloaded from NCBI GEO with accession code GSE131907 [20]. Data covering 11 tumors and 11 distant normal lungs were selected and processed by using the Seurat R package v3.2.2 [21]. We discarded cells that expressed < 200 genes. To exclude low-quality cells from our data, we filtered out the cells that expressed mitochondrial genes in > 20% of their total gene expression. In each cell, the gene expression was normalized on the basis of the total read count and log transformed. To align the cells originating from different samples, 5,000 highly variable genes from each sample were identified by the vst method. We found anchors and aligned the samples based on the top 10 canonical correlation vectors. The aligned samples were scaled, and principal component analysis was conducted. Then, the cells were clustered by unsupervised clustering (0.5 resolution) and visualized by tSNE using the top 40 principal components (PCs). Known marker genes were used to assign each subcluster to a corresponding cell population: EPCAM and KRT19 for epithelial cells, DCN and COL1A1 for fibroblasts, PECAM1 and CLDN5 for endothelial cells, CD3D and TRAC for T lymphocytes, NKG7 and GNLY for NK cells, CD79A and IGHM for B lymphocytes, LYZ and CD68 for myeloid cells, and KIT and MS4A2 for mast cells.
Results
Identification of cell type‒specific cREs associated with human lung tissue
To characterize distal regulatory elements surrounding human lung tissue, we obtained 24 H3K27ac ChIP-seq datasets from the ENCODE, DBTSS, and GEO databases, representing eight major cell types (myeloid cells, T cells, B cells, NK cells, endothelial cells, epithelial cells, lung cancer cells, and fibroblasts) [2,3,15] and identified 86,312 distal cREs. The quantification of cRE activities in RPM indicated that the samples clustered according to cell type of origin (Fig. 1A), largely into two groups consisting of stromal (epithelial cells, endothelial cells, and fibroblasts) and immune cells (myeloid, NK, T, and B cells). The cRE profiles of lung cancer cell lines (A549, A427, and H322) were highly correlated with the epithelial cell type, reminiscent of the cell type of their origin. We identified cell type enriched cREs using a quasi-likelihood F test (false discovery rate [FDR] < 0.05) in Bioconductor package EdgeR by contrasting each cell type to the other cell types [22], which resulted in 45,706 cell type‒specific cREs (9,707 for lung cancer cells, 14,135 for epithelial cells, 6,952 for myeloid cells, 5,486 for endothelial cells, 4,891 for B cells, 4,447 for fibroblasts, 2,194 for T cells, and 193 for NK cells) (Fig. 1B). This large portion of cell type‒specific cREs (52.95%) suggests a dynamic role of regulatory genomic elements in determining cellular identity.
Characterization of lung cancer‒associated genetic variations with cell type-specific cREs
To assess the association of common genetic risk variants for lung cancer with cREs with cell type annotation, we collected 286 tag SNPs from the NHGRI-EBI GWAS catalog (downloaded on 2019.08.20) [18] covering nine lung cancer-related traits. Our LD-based association analysis (r2 > 0.8) showed that 21.33% (62 of 286) of lung cancer SNPs were linked to cREs (Fig. 1C). Among them, 28 tag SNPs were associated with cell type-specific cREs (a full list with the corresponding cell type is provided in Table 1), and 34 tag SNPs were linked to constitutive cREs (Supplementary Table 3). SNP-harboring cREs exhibited cell type‒specific activities, and myeloid (n = 11) and epithelial (n = 10) cell types were recognized by having the most SNP-harboring cREs, highlighting a potential role of these cell types in lung carcinogenesis (Fig. 1D).
Identification of distal target genes of SNP-harboring cREs by high-resolution chromatin contact maps
Despite the enrichment of lung cancer‒associated genetic variants in cREs, their biological function in lung cancer is largely unknown due to the lack of information about their functional target genes. We hypothesized that investigating the physical chromatin contacts between promoters and SNP-harboring cREs may substantially advance our current knowledge regarding the possible regulatory role of noncoding genetic variants. To this end, we downloaded in situ Hi-C data for the A549, IMR90, GM12878, and HMEC cell lines from the ENCODE database [2,3], representing lung cancer, fibroblasts, myeloid cells, and epithelial cells, respectively. We defined long-range chromatin interactions at 5 kb resolution, implementing the Fit-Hi-C algorithm (FDR < 0.01) [19], which resulted in a total of 3,785,594 unbiased, all-to-all chromatin interactions (1,905,639 for A549, 1,207,580 for IMR90, 1,729,755 for GM12878, and 220,529 for HMEC). Focusing on the chromatin interactions connected to the well-annotated protein-coding gene promoters, we identified chromatin interactions anchored within 2.5 kb of a TSS. Comparison of these promoter-centered interactions across the cell types presented a highly dynamic pattern, with lung cancer (A549) and myeloid (GM12878) cells having a large number of unique chromatin interactions, implicating differential spatial arrangements between cREs and promoters in those cell types (Fig. 2A). The average distance for long-range chromatin interactions was similar across cell types (234 kb for A549, 222 kb for GM12878, 269 kb for IMR90, and 221 kb for HMEC) (Fig. 2B). Finally, we predicted putative target genes of cell type-specific cREs harboring lung cancer‒associated variants by using a union set of chromatin interactions (Fig. 2C). To assess the cell type-specificity of the inferred target genes, we integrated a scRNA-seq dataset generated from lung tissues of 11 healthy individuals and 11 lung cancer patients [20]. We found that the inferred target genes showed a higher gene expression in the corresponding cell type compared to the randomly selected controls (empirical p = 0.0016 from 100,000 iterations) (Fig. 2D). Although the statistical testing for cell type specificity in target gene expression indicated insignificance in enrichment for epithelial and myeloid cell types (Fisher’s exact test; p = 0.1591 for epithelial and p = 0.607 for myeloid), we identified 20 and 5 putative target genes in epithelial and myeloid cells, respectively, with the highest gene expression in the corresponding cell type (Fig. 2E). In addition, we found that HYLS1, IL1R1, and PPP1R18 were inferred target genes in endothelial, fibroblast, and T cell, respectively, which also presented a cell type‒specific gene expression. The list of inferred target genes with cell type‒specific gene expression is provided in Table 2. The statistical insignificance in cell type specificity may be the result of the limited number of tested target genes and the insufficient transcripts detected in the corresponding scRNA-seq technique. The improvement of scRNA-seq techniques covering a higher number of transcripts will help make a clear conclusion in this matter. Altogether, the inference of target genes of cell type‒specific, SNP-harboring cREs by using 3D chromatin interaction profile provided considerable insights into functionality of lung cancer‒associated GWAS-SNPs.
Characterization of DDR1 and CD84 as potential risk factors associated with lung carcinogenesis
Since the number of putative target genes with cell type dependent expression was the most prevalent in epithelial and myeloid cells, we sought to pin-point specific risk candidate genes by taking a close examination of the epigenomic landscape surrounding the major cell types in the human lung. First, we found that an epithelial-specific cRE (chr6:30,889,573-30,895,783; FDR = 1.61E-05) contains a SNP at chr6:30,894,965, which is linked to a tag SNP at chr6:30,882,415 (rs114274879) based on LD structure. The SNP was significantly associated with squamous cell lung carcinoma (p = 3.0E-16) [23]. The SNP-containing cRE was linked to the promoter of DDR1 by significant long-range chromatin interaction at 50 kb distance, which, in turn, (Fig. 3A). Interestingly, the function of DDR1 in tumor growth and metastasis displayed an epithelial-specific expression has been previously recognized [28,29]. Next, we identified an additional, myeloid-specific cRE (chr1:160,315,750-160,319,868; FDR = 0.030) that contains two LD-expanded SNPs at chr1:160,317,021 and chr1:160,317,619, whose parental tag SNP is located at chr1:160,210,727 (rs2369473). The SNP was reported for its association with squamous cell lung carcinoma (p = 7.0E-06) [23]. The SNP-containing cRE is linked to the promoter of CD84, a mediator of leukocyte function, by a significant long-range chromatin interaction stretching as far as 230 kb in distance (Fig. 3B). The recently demonstrated function of CD84 in chronic lymphocytic leukemia cells and their microenvironment may support the potential functional implication of CD84 in lung carcinogenesis [30]. Furthermore, single-cell transcriptome data indicated that the expression of CD84 is exclusive in myeloid cells (Fig. 3B). Our results highlight the potential role of DDR1 and CD84 in lung carcinogenesis within epithelial and myeloid populations, respectively. The current work involving the functional annotation of lung cancer GWAS-SNPs and the inference of their putative target genes using 3D chromatin contact information effectively expands potential risk candidate genes and their relevant cell types, and offers a rationale for a further investigation of its function in the designated cell type.
Discussion
We used a comprehensive multi-omics approach that integrates cell type‒specific epigenome, 3D chromatin interactome, and transcriptome data to conduct a functional characterization of lung cancer‒associated GWAS risk variants. It is worth noting that the cellular identity in the global cRE landscape is well replicated regardless of tissue origin, genetic background, and culture condition (e.g., primary cells and immortalized cell line), evidenced by the use of publicly available ChIP-seq data representing various cell types in the current study. This led us to effectively find a considerable portion of risk genetic variants associated with cREs, taking into consideration the genetic LD. The categorization of lung cancer GWAS-SNPs into corresponding cell types provides additional insights into specific cellular populations responsible for lung carcinogenesis. Our results provide evidence that the identification of the cell type-specific promoter-cRE interactome substantially advances the interpretation of GWAS risk variants and broadens the scope for disease risk candidates for lung cancer.
The recent advent of single-nucleus accessible chromatin profiling allows effective identification and characterization of cell populations within human tissues. For example, the single-nucleus ATAC-seq (snATAC-seq) data from the human lung generated by Wang et al. [31] identified six sub-clusters in the epithelial population (AT1/AT2, PNEC, club, basal, and ciliated cells) and cell type‒specific gene regulation associated with viral entry. However, the read-depth and coverage obtained in snATAC-seq data are considerably low for each cell type when compared with bulk ChIP-seq data utilizing primary cells and cell lines. The collection of individual ChIP-seq samples representing major cell types in the human lung, as conducted in this study, may allow a more discrete cell type‒specific investigation of regulatory dynamics. Moreover, the development of scRNA-seq allowed a population-based analysis of transcriptome data. However, it is worth noting that the number of genes detected by scRNA-seq is limited to a few thousand, which largely restricted our scope of investigating the inferred target genes of cell type‒specific cREs. Finally, the current work involving the functional annotation of lung cancer GWAS-SNPs and inference of their putative target genes is a notable endeavor, which will provide qualitative insights into disease mechanisms that may be of value in identifying new risk factors developing new approaches for prevention and treatment.
Notes
Authors’ Contribution
Conceptualization: AJL, IJ. Data curation: AJL. Formal analysis: AJL. Funding acquisition: IJ. Methodology: AJL. Writing - original draft: AJL. Writing - review & editing: AJL, IJ.
Conflicts of Interest
No potential conflicts of interest relevant to this article were reported.
Acknowledgements
This work was supported by the Ministry of Science and ICT through the National Research Foundation in Republic of Korea under grant number NRF-2020R1A2C4001464 to IJ. A portion of the data used for this study was obtained from the Genome-InfraNet (IDs: 1711056526 and 1711065396) of the Korean Bioinformation Center.
Supplementary Materials
Supplementary data can be found with this article online at http://www.genominfo.org.