An information-theoretical analysis of gene nucleotide sequence structuredness for a selection of aging and cancer-related genes
Article information
Abstract
We provide an algorithm for the construction and analysis of autocorrelation (information) functions of gene nucleotide sequences. As a measure of correlation between discrete random variables, we use normalized mutual information. The information functions are indicative of the degree of structuredness of gene sequences. We construct the information functions for selected gene sequences. We find a significant difference between information functions of genes of different types. We hypothesize that the features of information functions of gene nucleotide sequences are related to phenotypes of these genes.
Introduction
The problem of analyzing symbolic sequences appears in many areas of research, such as “big data” [1] and “dynamic systems” [2]. The most significant example of a symbolic sequence is nucleotide sequence. Moreover, a nucleotide sequence is an interesting and important mathematical object. Of special importance is the task of clustering nucleotide sequences [3-6]. A nucleotide sequence is hereby referred to as a sequence whose elements assume the values A, C, G, T. First the mathematical analysis of nucleotide sequences was suggested by the physicist Gamow in 1954 [7]. The problem of symbols relation of nucleotide sequences was first discussed by the physicist Yockey in the 1950s [8]. About 50 years later, in 2003, the mathematician Gelfand noted that “the use of mathematics in studying gene sequences is an adequate language” [9]. This implied the finding of formal (mathematical) properties of gene nucleotide sequences. Yet, insufficient attention has been paid to this subject.
The main method of investigating numeric sequences (or discrete numerical time series) is the construction and analysis of autocorrelation functions. However, the principal difference of numeric sequences from nucleotide sequences is that the nucleotides in the sequence take the symbolic values A, C, G, T. This means that statistical apparatus cannot be used for the analysis of such sequences, insofar as statistics does not have theoretically justified measures of correlation between symbolic (discrete) random variables. The impossibility of utilizing theoretically justified statistical methods in genetics has been noted earlier [10]. Therefore, information theory, having a solid theoretical justification, has been increasingly used in the study of biological data. Earlier we have applied information theory to analyze data on aging-related diseases [11,12], including cancer [13-15]. The approach described in [13] is presented in the monograph [16].
An overview of the use of information theory for the analysis of biological sequences, in particular DNA sequences, has been presented earlier [17-19]. In the work by Li (1990) [20], mutual information was first used as a measure of correlation for autocorrelation symbolic sequence function. However, mutual information is a non-normalized value, and therefore it does not allow the researchers, in the general case, to compare different mutual information functions for different symbolic sequences.
The present work, for the first time, uses normalized mutual information as a measure of correlation to construct an autocorrelation function for the symbolic (nucleotide) sequence. Hence, we will refer to this function as information function. The use of normalized mutual information allows us to compare information functions of any symbolic sequences. The present article presents an algorithm for distributing sets of genes according to their information functions, that is, according to the interconnection between elements in the nucleotide sequences of these genes. Each value of the information function estimates the interconnection between elements of a nucleotide sequence with a corresponding lag. The set of all the values of the information function provides an estimate for the interconnection of the elements in a nucleotide sequence with all the lags, that is to say, it provides an estimate of the degree of structuredness of that sequence.
It may be hypothesized that genes with “close” information functions may produce similar phenotypes, and the proposed approach may help reveal unknown phenotypic properties of genes according to their nucleotide sequences.
Methods
Gene sequences
To illustrate the algorithm of distribution, we consider the nucleotide sequences of 14 genes. Table 1 lists the genes and their sizes as the number of nucleotides. The data on the genes’ sequences were obtained from NCBI Gene database (https://www.ncbi.nlm.nih.gov/gene).
The genes used in this study—BCL2, mTOR, FOXO3, FOXO1, IGF1, BRCA2, BRCA1, Klotho, Sirtuin 1, p16, BECN1, CCND1, Sirtuin 6, APOE—were selected for the most part insofar as these genes are often recognized as being involved in aging processes and often constitute networks in aging-associated pathways [21,22]. Hence both their phenotypic properties and their possible mutual relation could be suggested.
Mathematical analysis
Let X be a symbolic random variable with a distribution function as follows:
Entropy of random variable X is as follows:
Let X and Y be symbolic random variables. The mutual information between the variables X and Y is as follows:
where H(XY) is the entropy of the product of the random variables X and Y.
Let X and Y be symbolic random variables. The normalized mutual information (also termed “uncertainty coefficient”) is as follows:
The normalized mutual information has the following properties.
(1) 0 ≤ C(X;Y) ≤ 1.
(2) C(X;Y) = 0 if and only if the random variables X and Y are independent (no correlation between the variables).
(3) C(X;Y) = 1 if and only if there is a functional relation (correlation or influence) between X and Y.
Let x(n) = (x(1), x(2), …, x(n),…) represent discrete time series having symbolic values.
Let x(n+j) = (x(1 + j), x(2 + j),…, x(n + j),…) be a time series x(n) with a lagj.
The auto-mutual information of the time series x(n) with a lagj equals:
I(x(n); x(n + j)) = H(x(n)) + H(x (n + j)) – H(x(n), x(n + j)).
The normalized auto-mutual information of the time series x(n) with a lagj equals:
The normalized mutual information C(x(n); x(n+j)) is then calculated as a function of the lagj.
We shall refer to function F(j)=C(x(n); x(n + j)) as the information function of the discrete time series x(n).
Properties of the information function F(j) are as follows.
(1) 0 ≤ F(j) ≤ 1.
(2) F(j) = 0 if and only if x(n) and x(n + j) are mutually independent.
(3) F(j) = 1 if and only if there exists a functional relationship between x(n) and x(n + 1).
Let {x1(n), x2(n), …, xk(n)} be a set of discrete time series, whose elements are symbols, e.g. gene nucleotide sequences, n = 1, 2, 3,…, and the maximum value n for a sequence xi(n) 1 ≤ i ≤ k equals the number of elements in this nucleotide sequence.
The algorithm of distributing a set of time series {x1(n), x2(n), …, xk(n)} consists of three procedures: (1) construction of an information function matrix; (2) ranking of columns of the information function matrix; and (3) application of a multiple comparisons method.
Construction of an information function matrix
For each time series xi (n) 1 ≤ i≤ k, we construct the information function as follows:
Fi(j) 1 ≤ i ≤ k, 1 ≤ j ≤ m,
where m is the number of lags in the information function.
We obtain the k x m [Fi(j)] matrix of values of the information functions, i.e., a matrix where each row is an information function of the corresponding time series.
Ranking of columns of the information function matrix
Each row of [Fi(j)] matrix is an information function of time series, and each column contains the values of information functions corresponding to the same lag.
For each column of [Fi(j)] matrix, we rank its entries and assign the rank 1 to the smallest entry of the column. We obtain k x m matrix of ranks [ri(j)], with each column of the matrix containing ranks from 1 to k.
We estimate the element interconnection of the i-th time series as compared to the element interconnection of other time series by the sum of all the elements of i-th row of the matrix [ri(j)]. Such an estimation allows us to use multiple comparisons of rank statistics for the comparison of time series interconnection.
Results
The values and clustering of gene information functions
Following the above algorithm for distributing a set of time series, we calculate and cluster the values of gene information functions, as follows.
(1) For each gene, out of the 14 genes under consideration, we calculate the information function with 12 lags. We obtain the information functions matrix [Fi(j)] 1 ≤ i ≤ 14, 1 ≤ j ≤ 12 (Table 2).
(2) We rank the entries of each column of the information function matrix, while attributing to the least values the rank 1. We obtain the rank matrix [ri(j)] 1 ≤ i ≤ 14, 1 ≤ j ≤ 12 (Table 3).
Let us consider Table 3 as the Friedman statistical model [24] and examine the row effect of this table.
Hypotheses:
H0: There is no row effect (“null hypothesis”).
H1: The null hypothesis is invalid.
Critical range:
The sample is “large”, therefore, the critical range is the upper 1%-range of χ213 distribution.
Let us calculate the χ2-criterion. This gives us χ2 = 91.65. The critical range is χ213 > 27.69. Since 91.65 > 27.69, the null hypothesis with respect to Table 3 is rejected. Thus, according to the Friedman test, the row effect has been found. Hence, there is a difference between the rows under consideration.
For multiple comparisons, we use the Newman-Keuls test. We obtain /Ri - Ri+1/ > 8.93, where Ri and Ri+1 are elements of the column “Sum of ranks” in the i-th and (i+1)-th rows of Table 3, respectively. By multiple comparisons, we construct the clustering shown in Table 4.
The obtained clustering possesses the following properties: (1) For two neighboring sets of Table 4, the smallest element of one set and the greatest element of another set located nearby are significantly different (αT = 0.01); (2) Elements belonging to the same set do not differ from each other (αT = 0.01).
Note that the differences between cluster 1 (APOE gene) and all the other elements (genes) are statistically significant (αT = 0.01). The same holds true for cluster 3 (BECN1 gene), cluster 4 (mTOR gene), and cluster 7 (IGF1 gene).
The significance of gene information functions
The domain of the information functions under consideration is the set {Lag 1, Lag 2, Lag 3, … , Lag 12}, and the values are the set of real numbers 0 to 1. We perform the comparative analysis of the values of information functions on the domain of those functions.
In Table 2, each row represents the values of the information function of a corresponding gene. We rank the values of each row of Table 2, attributing rank 1 to the least value. We obtain Table 5.
We evaluate the values of the information functions in Lag j as the sum of elements of the column Lag j of Table 5. Let us consider Table 5 as the Friedman statistical model, and examine the column effect of this table.
Hypotheses:
H0: There is no column effect (“null hypothesis”).
H1: The null hypothesis is invalid.
Critical range:
The sample is “large”, therefore, the critical range is the upper 1%‒range of χ211 distribution.
Let us calculate the χ2-criterion. This gives us χ2 = 121.5. The critical range is χ211 > 24.73. Since 121.5 > 24.73, the null hypothesis with respect to Table 4 is rejected. Thus, according to the Friedman test, the column effect has been found. Hence, there is a difference between the columns under consideration.
For multiple comparisons, we use the Newman-Keuls test. We obtain /Ri ‒ Ri+1/ > 9.64, where Ri and Ri+1 are elements of the column “Sum of ranks” in the i-th and (I + 1)-th rows of Table 5, respectively. By multiple comparisons, we construct the clustering shown in Table 6.
The obtained clustering possesses the following properties: (1) For two neighboring sets of Table 5, the smallest element of one set and the greatest element of another set located nearby are significantly different (αT = 0.01); (2) Elements belonging to the same set do not differ from each other (αT = 0.01).
Note that the differences between cluster 1 (Lag 1) and all the elements are statistically significant (αT = 0.01). The same holds true for cluster 2 (Lag 2), cluster 3 (Lag 6), and cluster 4 (Lag 3).
The values of the information functions in Lag 6 are greater than the values of the information functions in Lag 3, Lag 4, and Lag 5. This signifies that, for the group of genes under consideration, the interconnection between elements distanced five elements from each other is greater than the interconnection between elements located closer together, namely distanced 2, 3, and 4 elements from each other.
Discussion
In this work we established a novel information theory based method for the evaluation of the level of structuredness of gene sequences (information function) by the sequences’ normalized mutual information. This new method may serve as an additional structural evaluation tool for genomic analysis, and for omics biomarkers analysis generally. In the future, it may be possible to associate between the gene structuredness as evaluated by the present method and the expression and phenotype of particular genes under consideration. Here we, for the first time, describe the methodology to calculate the gene structuredness, while the association of the gene structuredness with gene expression and phenotypic function will be the task of future work.
Even though the present work only describes the methodology, some hypotheses may be advanced considering the possible association of the value of gene structuredness as shown here by the clustering (Table 4) with some known phenotypic properties of the selected genes considered in this study. Thus the genes APOE, BECN1, mTOR, and IGF1 each form a separate cluster according to their level of structuredness. This may indicate that each of these genes possesses properties not common for the other genes. The genes FOXO1 and FOXO3 are in the same cluster, which may be expected for the genes of the same group. Interestingly, the genes BRCA1 and BRCA2 are found in different clusters. As it has been demonstrated, the BRCA1 and BRCA2 genes are associated with different types of tumors, and this distinction may have been reflected in the information function (structuredness) of these genes [25,26].
Of special interest are clusters 2 and 5. Сluster 2 includes the genes Sirtuin 1 and Sirtuin 6, together with the genes CCND1, p16, and BRCA1. A special characteristic of all these five genes in cluster 2 is that under conditions of overexpression, these genes are associated with oncological diseases, though not necessarily under conditions of normal expression or under-expression [27-32]. On the other hand, a characteristic feature of cluster 5 is that all the three genes in this cluster—Klotho, BRCA2, and BCL2—under conditions of under-expression are associated with oncological diseases [33-35]. Yet, under normal expression or overexpression, such an association is not observed. Thus it may be hypothesized that the level of gene sequence structuredness, at least in the present gene selection, may be somehow associated with effects of extreme gene expression, either overexpression or under-expression. Yet, a clarification of such a hypothesis, as well as positing and testing additional hypotheses for a potential association of gene structure and function, will require further extensive investigation.
Notes
Authors’ Contribution
Conceptualization: DB, IS. Data curation: IS, JG. Formal analysis: DB, JG. Methodology: DB. Writing – original draft: DB, IS. Writing – review & editing: DB, IS.
Conflicts of Interest
No potential conflict of interest relevant to this article was reported.