In silico approach to calculate the transcript capacity
Article information
Abstract
We sought the novel concept, transcript capacity (TC) and analyzed TC. Our approach to estimate TC was through an in silico method. TC refers to the capacity that a transcript exerts in a cell as enzyme or protein function after translation. We used the genome-wide association study (GWAS) beta effect and transcription level in RNA-sequencing to estimate TC. The trait was body fat percent and the transcript reads were obtained from the human protein atlas. The assumption was that the GWAS beta effect is the gene’s effect and TC was related to the corresponding gene effect and transcript reads. Further, we surveyed gene ontology (GO) in the highest TC and the lowest TC genes. The most frequent GOs with the highest TC were neuronal-related and cell projection organization related. The most frequent GOs with the lowest TC were wound-healing related and embryo development related. We expect that our analysis contributes to estimating TC in the diverse species and playing a benevolent role to the new bioinformatic analysis.
Introduction
There have been various experimental studies regarding enzyme activity [1,2]. Enzyme activity is defined by a measure of the quantity of active enzyme present. Most enzyme activity studies are based on in vitro experiments. This approach is a limited method because it does not contain in vivo situations. Our approach was to introduce transcript capacity (TC) concept and this concept has both resemblance and difference in comparison to the enzyme activity estimation. It can be considered that TC plays a role like enzyme activity as transcript activity or TC but it is mainly related to the analyzed traits. Thus, TC does not match to the concept of enzyme activity, perfectly. Although enzyme activity is not directly associated with TC, the investigation of TC can be one of the important route to examine enzyme activity because both deal with the capacity of specific cellular units, i.e. enzyme and transcript. We calculated TC using the genome-wide association study (GWAS) beta effect and transcript reads in RNA-sequencing (RNA-seq) data, and our study is based on an in silico analysis. Our novel approach was based on that gene effect could be a function of TC and transcript reads. TC refers to the capacity that one unit of transcripts exerts as a cellular function. TC cannot be easily measured in experiments. We calculated TC using bioinformatics studies.
GWAS can be used for finding significant variants and genes associated with given traits. It is a very efficient and powerful method for detecting genes of significance. GWAS has been used to discover disease-associated genes and quantitative trait loci genes [3-5]. Lu et al. [6] tried to discover new loci associated with body fat percent (BF%) and identified cardiometabolic disease risk genomic factors. For TC estimation, we used BF% as the phenotype and performed GWAS. The beta effect in GWAS denotes the coefficient of the regression model and it can be the additive effect of the single nucleotide polymorphism (SNP) in the GWAS [7-9]. The significant markers associated with phenotypes have been used as the significance of the encompassed genes in GWAS. This is plausible because SNPs is linked at those encompassing genes which are usually called “linkage disequilibrium.”
RNA-seq data provides transcripts’ reads [10,11]. Through transcript reads and gene effects, we calculated TC. TC analysis can be important because it can be further analyzed the enzyme activity if the protein expression level instead of transcript reads is given.
Methods
Data description
For GWAS, we used Ansan-Anseong cohort data. These were for a study of a chronic diseases within Ansan city and Anseong rural areas in Korea. The dataset comprised men (8,842 people) between 40–69 years of age who had been residents of the region for at least 6 months [12,13]. Our study was from the 3rd Ansan-Anseong cohort dataset version 2.1. The analyzed phenotypes were BF% unit and the covariates were set to be area, age and sex. The SNP dataset was implemented using Affymetrix Genome-wide Human SNP Array 5.0 (Affymetrix, Santa Clara, CA, USA). The mean call rate was 99.01%. The total number of SNPs was 352,228 and after quality control (minor allele frequency < 0.05, Hardy-Weinberg equilibrium p-value < 0.0001 and missing genotype rate > 0.05), 308,003 SNPs were left.
For transcript reads, we used transcript reads per million (TPM) data at the following website (http://www.proteinatlas.org). The Human Protein Atlas was released with protein profile data covering 48 different human tissues and organs, including adipocytes, the kidney and the liver [14,15]. Among these organs, we used adipocyte’s transcript reads data. The TPM data of the Human Protein atlas is based on the reads per gene. Thus the gene length was pre-considered for the accurate reads estimation per gene.
Genome-wide association study (GWAS)
GWAS was performed using GCTA (a tool for genome-wide complex trait analysis) to estimate the beta effects [9,16]. The following model was:
where y is the phenotypic value (BF%), a is the mean term, b is the additive beta effect of the candidate SNP for association, x is the SNP genotype indicator variable, g is the polygenic effect, and e is the residual. The covariates were sex (male and female), area (Ansan and Anseong) and age. The age was factored to be 10-age steps. The b was the beta effect of the SNPs.
TC calculation
TC was calculated using the following relationships:
Eq. (2) uses the gene effect and transcript reads (TPM) to determine the TC. We assumed that the effect of the gene was proportional to the transcript reads and TC. The gene effect is proportional to TC as given in Eq. (2). If A and B genes’ effect is 10, 10, respectively and transcript reads are 1,000 and 10,000, then TCs of A and B are 0.01 and 0.001. Thus the capacity of the A transcript is 10-fold stronger than B transcript. One unit of A transcripts influences 10-fold to the traits in comparison to the B transcripts.
Gene ontology analysis
The gene ontology (GO) analysis was performed using DAVID (Database for Annotation, Visualization and Integrated Discovery) [17]. The gene catalogue was retrieved from Ensembl DB (http://www.ensembl.org). We selected the genes with the highest and lowest TC (top and bottom 5% in TC values) for the GO analysis.
Results
Figure description
Fig. 1 shows the flow chart of our analysis. It explains the procedure of the TC calculation. The gene effect was calculated using GWAS and TC was calculated using gene effect and TPM. As shown in Eq. (2), the TC unit in our analysis is BF%. Fig. 2 shows the plot of TPM and GWAS beta effect. The genes with a higher TPM had a smaller beta effect across the board. According to Eq. (2), a higher TPM and a smaller beta effect would show a smaller TC. Fig. 3 shows the Manhattan plot of –log10(p-value) and the TC across chromosomes. The p-value was from GWAS results and TC was calculated as shown in Fig. 1.
TC calculation and GO analysis
Table 1 shows the summary statistics (minimum, maximum, average, standard deviation) of GWAS beta effect, TPM and TC. Table 2. shows the gene information with the lowest p-values (<0.0001) and the estimated TC. The TC was simply calculated using Eq. (2). Table 3 illustrates genes’ TC information of the highest p-values (top 5% genes in TC value). The neuronal genes including neuronal-activity regulated genes have important functions in dendrites and synapses and are likely to regulate circuit connectivity directly. Thus for the easy regulation of circuit connectivity, they might have the strategy through possessing low transcript reads and high TC [18]. Brain-derived neurotrophic factor (BDNF; TC, 1.88 BF%) encodes a neurotrophin that is secreted at the synapse. The induction of BDNF promotes both synapse maturation and dendritic growth. BDNF had high TC and its mutation can cause neurological and psychiatric disorders [18]. Table 4 shows the GO with the lowest TC (bottom 5% genes). It shows that the lowest TC’s major GO terms were endoderm formation, wound healing and embryo development. The von Willebrand factor (VWF) is not an enzyme and thus, has no catalytic activity (https://en.wikipedia.org/wiki/Von_Willebrand_factor). In our analysis, the VWF had high transcript reads and low TC (TPM, 155.1; TC, 0.002 BF%).
Discussion
GWAS and expression quantitative trait loci
We used the GWAS and RNA-seq data associated with body fat. The expression quantitative trait loci (eQTL) are genomic loci that explain a variation in expression levels of mRNAs (https://en.wikipedia.org/wiki/Expression_quantitative_trait_loci). Parks et al. [19] showed the genetic control of obesity and gut microbiota composition using eQTL analysis. In our analysis, the accurate beta effect estimation accompanied by p-value are crucial and thus eQTL analysis can help better estimating TC because eQTL information contains the mRNA expression level.
GWAS p-value and accuracy of TC calculation
In GWASs, the significance is guaranteed by the p-value. The p-value is the criteria to dissect the significant variants and those genes from insignificant ones. Although genes’ effects were dissected by the p-value, only the beta effect was used for TC calculation. The gene’s GWAS p-value and beta effect can be varied with respect to analyzed phenotypes. The highly accurate TC calculation should be certified by using diverse phenotypes in GWAS. Additionally, the transcript reads in RNA-seq data can be diverse in various tissues. By using diverse tissues and information from various traits, the accurate TC calculation can be plausible for generally acceptable estimation.
BF% can be used to approximate fat accumulation in adipose tissues. The trait reflects the fat accumulation and the gene’s play in adipose tissue. The GWAS directs the significant variants associated with body fat but the TC quantity only mirrors the transcript activity only related to the analyzed traits as indicated by TC unit (BF%).
Our GWAS calculation for the gene effect was based on the linkage disequilibrium between SNP markers and the gene. Despite the advantages of GWAS using SNP markers, diverse SNP markers per gene can cause a problem. Thus gene-based GWAS can be an another alternative method. And RNA-seq data had better be obtained from the similar sample to the GWAS dataset.
The neuronal and cell projection organization genes were enriched in the GO analysis (Table 2). These genes have low p-values in the GWAS analysis of BF% across the board. Inspecting the relationship between TC and GO terms why the lowest TC values are associated with certain GO terms, is a subject that needs to be addressed. Likewise, the reason why wound healing and embryo development GO terms have low TC values must be elucidated.
Features of the TC calculation in our study
Unlike previous studies that calculate the protein activity, TC calculation is a high-throughput analysis. Vermeirssen et al.[20] used a quantitative in silico analysis to calculate the inhibitory activity of angiotensin I converting enzyme (ACE). They used an in vivo analysis to calculate ACE activity, also. Our study to calculate TC is theoretically novel. Additionally, it is feasible not only to calculate TC, but also to calculate other annotated ones including transcription factor whose activity cannot be easily measured experimentally.
Notes
Authors’ Contribution
Conceptualization: YSL. Data curation: JDO, DS, KHW. Formal analysis: YSL. Funding acquisition: YSL. Methodology: YSL. Writing - original draft: YSL. Writing - review & editing: JDO, DS.
Conflicts of Interest
No potential conflict of interest relevant to this article was reported.
Acknowledgements
This research was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (No. NRF-2017R1A6A3A11033784 & NRF-2017R1C1B3007144).