Metal binding proteins or metallo-proteins are important for the stability of the protein and also serve as co-factors in various functions like controlling metabolism, regulating signal transport, and metal homeostasis. In structural genomics, prediction of metal binding proteins help in the selection of suitable growth medium for overexpression’s studies and also help in obtaining the functional protein. Computational prediction using machine learning approach has been widely used in various fields of bioinformatics based on the fact all the information contains in amino acid sequence. In this study, random forest machine learning prediction systems were deployed with simplified amino acid for prediction of individual major metal ion binding sites like copper, calcium, cobalt, iron, magnesium, manganese, nickel, and zinc.
Amino acids play a central role in the building block of protein. The primary structure of the protein is determined by the arrangement of 20 naturally occurring amino acids. The function of a protein is determined from their amino acids and also they depend upon interaction with cofactors, binding with metal ions and interaction with other proteins. The proteome of all the organism share significant metal ions and metal binding cofactors to carry out its essential function. It has been estimated that approximately 30% of all proteins contain at least one metal. The proteins play a vital role in biological processes and in the stability of the protein by binding with metal ions or metal containing-cofactors [
Due to the availability of cheap and advancement of sequencing instruments, the sequence of proteins has increased rapidly over when compared to protein structure data. This due to the fact that experimental determining the three-dimensional of protein is difficult and expensive. Through various theoretical and experimental studies, it is proved that minimal set of the amino acid is sufficient for protein folding [
All the protein sequences were downloaded from the UniProt database [
This resulted in eight data sets containing 186 calcium-containing proteins, 69 cobalt-containing proteins, 215 copper-containing proteins, 315 iron-containing proteins, 961 magnesium-containing proteins, 386 manganese-containing proteins, 74 nickel-containing proteins, and 1,716 zinc-containing proteins. All proteins containing calcium, cobalt, copper, magnesium, manganese, nickel, or zinc were then subtracted from the UniRef50 list, resulting in a collection of non-metalloproteins. The workflow of dataset construction is shown in
In order to investigate the effect of a particular class of amino acids on metal ion binding, the 20 amino acids were grouped into various classes based on certain common properties and the composition of the reduced sets of amino acids was considered. Feature extraction is done using the simplified amino acid alphabet. It estimates that reduced alphabets containing 10–12 letters can be used to design foldable sequences for a large number of protein families. This estimate is based on the observation that there is little loss of the information necessary to pick out structural homologs in a clustered protein sequence database when a suitable reduction of the amino acid alphabet from 20 to 10 letters is made.
A simplified amino acid alphabet of 18 characters was used (
Conformational similarity indices are proposed by Chakrabarti and Pal [
The BLOSUM-50 matrix is proposed by Cannata
The hydrophobicity scale by Rose
Random forest is a classification algorithm [
When the predictor was focused on the problem of distinguishing proteins containing a certain type of metal ion from proteins that do not contain any type of metal, it is important that both sets contain the same number of proteins; otherwise, several figures of merit that are commonly used to monitor the prediction reliability would be seriously biased. The reliability of the predictions was monitored with the following quantities. If a protein of type 1 must be distinguished from a protein of type 2, a prediction was considered to be a true-positive if type 1 was correctly predicted; it was considered to be a true-negative if type 2 was correctly predicted; it was considered to be a false-negative if a type 1 protein was predicted to be a type 2 protein; and it was considered to be a false-positive if a type 2 protein was predicted to be a type 1 protein. Consequently, the following figures of merit, the sensitivity, the specificity, the accuracy, the Mathews correlation are computed [
By using a simplified amino acid alphabet based on three independent amino acid classifications, amino acid cluster variables were obtained. Conformational similarity contains seven clusters: [CMQLEKRA], [P], [ND], [G], [HWFY], [S], and [TIV]. BLOSUM 50 substitution matrix contain [P], [KR], [EDNQ], [ST], [AG], [H], [CILMV], and [YWF]. The hydrophobicity scale contains [CFILMVW], [AG], [PH], [EDRK], and [NQSTY]. Out of 20 amino acid clusters, cluster [P] and [AG] which are present in more than one simplified alphabet were considered only once and these results in 18 variables (
The percentage of occurrence
The specific steps of the wrapper approach followed in this study.
Partitioning the data with 10-fold cross-validation (
On each cross-validation training set, the learning machine was trained by using all 18 variables, to produce a ranking of the variables according to the importance. The cross-validation test set predictions were recorded.
Then the variables are removed which are least important one by one and another learning machine was trained based on remaining variables, the cross-validation test set predictions were once again recorded. This step is repeated by removing each variable until at small number remain.
Aggregate the predictions from all 10 cross-validation test sets and compute the aggregate accuracy at each step down in a number of variables.
By the following the above steps, feature selection of variables was done by wrapper approach employing random forest machine learning algorithm. Based on aggregate accuracy, the important variables for copper ion prediction are PH variable and least preferred variables are AG and CMQLEKRA (
For example, cobalt metal binding protein can be discriminated from non-metal ions with all 18 variables with the accuracy of 85% (
In this work, a new random forest based approach is developed combining hybrid feature of simplified amino acid alphabets for prediction of metal ion binding sites of iron, copper manganese, magnesium, nickel, calcium, cobalt, and zinc from amino acid sequence data. The result indicates that the random forest model has a high prediction accuracy in predicting metal ion binding sites. These metal binding prediction methods are helpful to avoid the selection of ‘impossible’ targets in structural biology and proteomics.
The author acknowledges Department of Diagnostic and Allied Health Sciences, Faculty of Health and Life Sciences, Management & Science University, Shah Alam, Selangor Darul Ehsan, Malaysia for providing necessary infrastructure facility to carry out this research.
Construction of dataset used for prediction.
The performance graph of the Random forest classifier using feature selection (10-fold cross validation for cobalt ion prediction).
The 18 variables, obtained by merging three simplified alphabets of amino acid residues used to represent protein sequences
Variable | Residues |
---|---|
V1 | CMQLEKRA |
V2 | P |
V3 | ND |
V4 | G |
V5 | HWFY |
V6 | S |
V7 | TIV |
V8 | CFILMVW |
V9 | AG |
V10 | PH |
V11 | EDRK |
V12 | NQSTY |
V13 | FWY |
V14 | CILMV |
V15 | H |
V16 | ST |
V17 | EDNQ |
V18 | KR |
Overall prediction performance of the classifier in predicting individual metal ion binding sites
Metal | Sensitivity | Specificity | Mathews correlation | Accuracy |
---|---|---|---|---|
Ca | 0.769 | 0.739 | 0.507 | 0.754 |
Co | 0.884 | 0.823 | 0.708 | 0.853 |
Cu | 0.746 | 0.815 | 0.563 | 0.781 |
Fe | 0.772 | 0.740 | 0.512 | 0.756 |
Mg | 0.766 | 0.714 | 0.481 | 0.740 |
Mn | 0.729 | 0.647 | 0.378 | 0.688 |
Ni | 0.945 | 0.869 | 0.817 | 0.907 |
Zn | 0.740 | 0.640 | 0.382 | 0.690 |
Feature selection of variables in improving the performance of copper ion prediction against proteins that lack metal ions
Variable removed | Average sensitivity | Average specificity | Average accuracy | Average Mathews correlation |
---|---|---|---|---|
None | 0.746 | 0.815 | 0.781 | 0.563 |
AG | 0.762 | 0.809 | 0.786 | 0.571 |
CMQLEKRA | 0.794 | 0.804 | 0.799 | 0.599 |
NQSTY | 0.779 | 0.814 | 0.796 | 0.593 |
EDNQ | 0.796 | 0.797 | 0.796 | 0.592 |
CFILMVW | 0.785 | 0.803 | 0.794 | 0.588 |
TIV | 0.785 | 0.798 | 0.792 | 0.583 |
PH | 0.774 | 0.801 | 0.788 | 0.576 |
Feature selection of variables in improving the performance of calcium ion prediction against proteins that lack metal ions
Variable removed | Average sensitivity | Average specificity | Average accuracy | Average Mathews correlation |
---|---|---|---|---|
None | 0.769 | 0.738 | 0.754 | 0.507 |
P | 0.783 | 0.758 | 0.770 | 0.541 |
EDNQ | 0.788 | 0.751 | 0.770 | 0.541 |
EDRK | 0.796 | 0.758 | 0.777 | 0.554 |
PH | 0.785 | 0.756 | 0.770 | 0.541 |
CILMV | 0.801 | 0.754 | 0.777 | 0.556 |
AG | 0.790 | 0.749 | 0.770 | 0.539 |
CFILMVW | 0.789 | 0.765 | 0.777 | 0.554 |
NQSTY | 0.785 | 0.767 | 0.776 | 0.552 |
CMQLEKRA | 0.780 | 0.765 | 0.772 | 0.545 |
Feature selection of variables in improving the performance of cobalt ion prediction against proteins that lack metal ions
Variable removed | Average sensitivity | Average specificity | Average accuracy | Average Mathews correlation |
---|---|---|---|---|
None | 0.884 | 0.823 | 0.853 | 0.708 |
CILMV | 0.903 | 0.842 | 0.872 | 0.747 |
CFILMVW | 0.899 | 0.837 | 0.868 | 0.737 |
ND | 0.894 | 0.828 | 0.861 | 0.724 |
EDNQ | 0.884 | 0.833 | 0.858 | 0.717 |
PH | 0.894 | 0.847 | 0.870 | 0.741 |
ST | 0.903 | 0.837 | 0.870 | 0.742 |
NQSTY | 0.860 | 0.833 | 0.846 | 0.693 |
Feature selection of variables in improving the performance of iron ion prediction against proteins that lack metal ions
Variable removed | Average sensitivity | Average specificity | Average accuracy | Average Mathews correlation |
---|---|---|---|---|
None | 0.772 | 0.740 | 0.756 | 0.512 |
NQSTY | 0.778 | 0.731 | 0.754 | 0.509 |
S | 0.786 | 0.727 | 0.757 | 0.514 |
PH | 0.786 | 0.724 | 0.755 | 0.511 |
CMQLEKRA | 0.785 | 0.720 | 0.753 | 0.507 |
CFILMVW | 0.787 | 0.734 | 0.761 | 0.523 |
AG | 0.790 | 0.720 | 0.755 | 0.511 |
TIV | 0.780 | 0.725 | 0.753 | 0.507 |
HWFY | 0.790 | 0.735 | 0.762 | 0.525 |
Feature selection of variables in improving the performance of magnesium ion prediction against proteins that lack metal ions
Variable removed | Average sensitivity | Average specificity | Average accuracy | Average Mathews correlation |
---|---|---|---|---|
None | 0.766 | 0.714 | 0.740 | 0.481 |
ST | 0.779 | 0.714 | 0.746 | 0.494 |
ND | 0.774 | 0.720 | 0.747 | 0.494 |
NQSTY | 0.767 | 0.717 | 0.742 | 0.485 |
S | 0.772 | 0.711 | 0.742 | 0.484 |
HWFY | 0.770 | 0.716 | 0.743 | 0.487 |
PH | 0.777 | 0.709 | 0.743 | 0.487 |
CMQLEKRA | 0.775 | 0.708 | 0.741 | 0.484 |
Feature selection of variables in improving the performance of manganese ion prediction against proteins that lack metal ions
Variable removed | Average sensitivity | Average specificity | Average accuracy | Average Mathews correlation |
---|---|---|---|---|
None | 0.729 | 0.647 | 0.688 | 0.378 |
FWY | 0.731 | 0.717 | 0.734 | 0.474 |
EDNQ | 0.741 | 0.656 | 0.698 | 0.398 |
CMQLEKRA | 0.750 | 0.647 | 0.698 | 0.399 |
AG | 0.750 | 0.643 | 0.697 | 0.396 |
S | 0.739 | 0.660 | 0.700 | 0.400 |
Feature selection of variables in improving the performance of nickel ion prediction against proteins that lack metal ions
Variable removed | Average sensitivity | Average specificity | Average accuracy | Average Mathews correlation |
---|---|---|---|---|
None | 0.945 | 0.869 | 0.907 | 0.817 |
EDRK | 0.950 | 0.887 | 0.918 | 0.838 |
G | 0.931 | 0.892 | 0.917 | 0.824 |
NQSTY | 0.923 | 0.887 | 0.905 | 0.810 |
ST | 0.941 | 0.878 | 0.909 | 0.821 |
EDNQ | 0.936 | 0.865 | 0.900 | 0.803 |
FWY | 0.918 | 0.860 | 0.889 | 0.780 |
HWFY | 0.931 | 0.865 | 0.898 | 0.800 |
TIV | 0.927 | 0.869 | 0.898 | 0.797 |
Feature selection of variables in improving the performance of zinc metal ion prediction against proteins that lack metal ions
Variable removed | Average sensitivity | Average specificity | Average accuracy | Average Mathews correlation |
---|---|---|---|---|
None | 0.740 | 0.640 | 0.690 | 0.382 |
HWFY | 0.751 | 0.638 | 0.695 | 0.391 |
CMQLEKRA | 0.750 | 0.636 | 0.692 | 0.386 |
AG | 0.747 | 0.638 | 0.693 | 0.388 |
ST | 0.743 | 0.644 | 0.693 | 0.389 |
EDNQ | 0.743 | 0.636 | 0.689 | 0.381 |