ML of different nucleotide substitution models
BIC and AICc are the most important parameters for statistical analysis of ML to analyze the biological data. Both the BIC and AICc used to evaluate the best model among a finite set of models with penalty parameters. BIC based, on the likelihood function and AICc estimator of out-of-sample prediction error.
GTR model have lowest BIC and AICc score 347,395, 347,288 computed using MEGA with
K = 11 shown in
Fig. 1. In addition, rate of variation across sites (+G), the GTR + G model show BIC and AIC score slightly increase with respect to GTR. On further addition, a proportion of invariable sites (+I) and/or rate of variation across sites (+G), GTR + G + I model indicates 0.0072% elevation in BIC score and 0.00144% go up in AIC (
K = 13). HKY model (
K = 7) having lowest value for BIC 347473, AICc 347405, but higher than most appropriate GTR model. Similarly HKY + I + G model (
K = 9) simulated result shows the score get higher with respect to base model. Both the model JC + G + I and K2 + I (
K = 5) boast BIC and AICc criterion score highest. The deviation between GTR and K2 + I models is for BIC, AICc scores 1.49% and 1.50%, respectively.
It indicates ML method accurately fits of 24 different nucleotide substitution models for biological data of SARS-CoV-2, MERS-CoV, and SARS-CoV under neutral evolution. As per information theory, lowest BIC score preferred due to Bayesian probability and inference, while highest score criteria opted for AICc based on frequentist-based inference. Simulative investigation results reveal that differences between lowest and highest scores are around 1.5%, virtue of that SARS-CoV-2, MERS-CoV, and SARS-CoV data best fitted through GTR model. The corrected AIC model gives better results as compare to AIC value as correlated in
Eq. (3).
Nucleotide frequencies (
f) and rates of base substitutions rate (
r) are also key factor to justify best nucleotide substitution model using ML technique. The nucleotide frequencies predicted for GTR model are A = 0.28, U = 0.317, C = 0.195 and G = 0.207 of biological data of SARS-CoV-2, SARS-CoV, and MERS-CoV. The frequencies of nitrogenous base remain constant for first 12 models from GTR, GTR + G,GTR + G + I, HKY, TN93, HKY + G, TN93 + G, TN93 + G + I, HKY + G + I, GTR + I, HKY + I to TN93 + I. The nucleotide frequencies for T92, T92 + G, T92 + G + I, T92 + I models are (A = 0.299, U = 0.299, C = 0.201, G = 0.201) remain steady, but varied from prior methods. JC, JC + G, K2, K2 + G, K2 + G + I, JC + I, JC + G + I, K2 + I models replicated the same frequency at the rate 0.25 for all nitrogenous base as revealed in
Fig 2. Base substitution rates are also dependent on nucleotide substitutions models, in GTR model r(AU), r(UA), r(CA), r(GA) substitutions are dominated.
Fig. 3 replicate the min and max rate of rates of base substitutions irrespective of models are as follow, r(AU 0.077, 0.122), r(AC 0.05, 0.084), r(AG 0.77, 0.107), r(UA 0.074, 0.115), r(UC 0.06, 0.101), r(UG 0, 0.086), r(CA 0.073, 0.126), r(CU 0.079, 0.119), r(CG 0.05, 0.124), r(GA 0.079, 0.132), r(GU 0, 0.099), and r(GC 0.05, 0.086).
It has been observed that under the model of uniform substitution among site REV, TN93 HKY, the frequency parameters are free to exchangeability, while JC and K2 models have frequencies at uniform rate 1/4. Virtue of these statistical parameters, the models GTR, GTR + G, GTR + G + I, HKY, TN93, HKY + G, TN93 + G, TN93 + G + I, HKY + G + I, GTR + I, HKY + I, and TN93 + I shows similar results in term nucleotide frequencies. JC and K2 model rely on different frequency parameter, due to that JC, JC + G, K2, K2 + G, K2 + G + I, JC + I, JC + G + I, and K2 + I models replicate the result same mode. The estimates of transitional and transversional of substitution rates are of 1st + 2nd + 3rd position data using simulation of data.
Fig. 3 confirms that the number of transversional are larger than the number of transitions. In broad, the transitional/transversional varies from 0.57 (GTR model) to 0.89 (T92 + G + I), higher values indicate proportion of invariable sites (+I) and/or rate of variation across sites (+G) are more dominating in T92 model for SARS-CoV-2, SARS-CoV, and MERS-CoV biological sequence.
ML to estimate of substitution matrix and transition/transversion bias
Probability rate of substitution (R) using ML depends upon the base frequency parameters and nucleotide substitution models. Base frequency parameters ΠA = ΠC = ΠT = ΠU = 1/4 for JC and K2 models and for GTR, HKY, TN93, T3 models have all Πi free to exchange. Six different nucleotide substitution models were simulated for biological sequence data of SARS-CoV, MERS-CoV, and SARS-CoV-2.
JC substitution model shows the transitional and transversionsal substitutions rate 8.33, while transitional substitutions for all base are 9.32 and transversionsal substitutions is equal to 7.84 for K2 parameter model. In general, HKY, TN93 models having transitional substitutions are more dominating in C-U and transitional substitution G-U and A-U. GTR and T3 parameter models resultant of higher transition substitution for A-G, 11.24 and 12.13, respectively. The lowest value of transition in GTR and T3 models also lies for same base (C-U). The highest probabilities of transversional substitutions (A-U) are the models are 9.93 and 12.05 as shown in
Table 1. In all models except than JC and K2, the lowest transitional substitutions observed C-U base pair. Overall transitional substitutions have higher hand as compare transversional substitutions in all models.
The estimated transition/transversion bias is 0.59 for K2-parameter model with codon positions included 1st + 2nd + 3rd + Noncoding that is not translated into a protein. There are a total of 43,053 positions in the final dataset. The transition/transversion bias for T93 and GTR equal to 0.56, while HKY and T3 parameter have transition/transversion bias is 0.57 as revealed in
Fig. 4. The variations in the entire model are from 0.56 to 0.59, and overall consistent value for transition/transversion bias.
JC and K2 models belong to one class of base frequency parameters, virtue of that JC model demonstrates equal rate of transition/transversion bias. K2 model shows constant rate of transition 9.32 and transversional 7.84 substitution biases. On the other hand, T93, T3, HKY, and GTR model exchangeability are free, due to that transitional and transversionsal substitutions rate are different. Transition/transversion bias is approximately 0.5 when that indicates no bias towards either transitional or transversional substitution because two kinds of substitution are equally probable, there are twice as many possible transversions as transitions.
SARS-CoV-2 nucleobase has higher frequency of T as compared to SARS-CoV, and approximately equal to MERS-CoV. Cytosine frequency of SARS-CoV-2 is less than both the biological sequences of SARS-CoV and MERS-CoV as shown in
Fig 5. The variation in cytosine base is around 9.6% with respect to SARS-CoV. The adenine nucleobase frequency is 29.896 of SARS-CoV-2 much higher than MERS-CoV and 5.75% modified from SARS-CoV. On the other hand, guanine frequency for current SARS-CoV-2 is much lesser than both the SARS-CoV and MERS-CoV. The average frequency of SARS-CoV-2, SARS-CoV, and MERS-CoV for U, C, A, and G are 31.74012, 19.48521, 28.04331, and 20.73135, respectively. Close value of nucleobase frequency (SARS-CoV-2, MERS-CoV, and SARS-CoV) reflects that SARS-CoV-2 is modified from previous respiratory syndrome virus.