Hypothetical Protein Predicted to be Tumor Suppressor: A protein Functional Analysis

Background Litorilituus sediminis is a Gram-negative, aerobic, novel bacterium under the family of Colwelliaceae, has a stunning hypothetical protein containing domain called von Hippel–Lindau (pVHL) that has signicant tumor suppressor activity. Therefore, this study was designed to elucidate the structure and function of the biologically important hypothetical protein EMK97_00595 (QBG34344.1) using several bioinformatics tools. Results The functional annotation exposed that the hypothetical protein is an extracellular secretory soluble signal peptide and containing the VHL (VHL beta) domain that has a signicant role in tumor suppression. This domain is conserved throughout evolution, as its homologs are available in various types of the organism like mammals, insects, and nematode. The gene product of VHL has a critical regulatory activity in the ubiquitous oxygen-sensing pathway. This domain has a signicant role to inhibit cell proliferation, angiogenesis progression, kidney cancer, breast cancer, and colon cancer. At last, the current study depicts that the annotated hypothetical protein is linked with tumor suppressor activity which might be of great interest to future research in the higher organism.


Background
Bacteria possess tremendous compatibility that can be used to the necessity of human welfare and Litorilituus sediminis can be one of them.Litorilituus sediminis is a Gram-negative, aerobic, curved-rod shaped, non-spore-forming, catalase, and oxidase-positive bacterium with the polar or sub-polar agellum.It was isolated from a sediment sample that was collected from the coastal region of Qingdao, China [1].This organism grew optimally at 37°C, pH 8-9.This type of bacterium was novel among the other genera under the family of Colwelliaceae.The characteristics like phenotypic, chemotaxonomic, and well-con rmed phylogenetic evidence of Litorilitus belongs to the family Colwelliaceae was distinctive that implied as a novel genus.This novel bacterium has a prominent concentration of cellular constituents comparing with other genera and these are C16:0 and C16:1 ω7c fatty acids, Phosphatidylethanolamine (PE), phosphatidylglycerol (PG), aminophospholipid (PN), and two amino lipids (AL1, AL2) as well as isoprenoid quinone 8 [1].Along with bacterial cellular components, a profuse number of proteins exist where approximately 2% of the genes code for proteins as well as the remaining are non-coding or still functionally unknown [2].
The number of genes having unknown functions referred to as hypothetical proteins is present in each organism's genome [3] and these are a category of the protein whose existence is not con rmed by any experimental evidence but can be predicted to be expressed from an open reading frame (ORF) [4].The hypothetical proteins can be classi ed as uncharacterized protein families (UPF) which are experimentally veri ed to exist but have not been identi ed or linked to a known gene, and the other type is the domain of unknown functions (DUF) [5] that is experimentally characterized proteins in the absences of known functional or structural domains [6] [7].Despite the lack of functional characterization, they play a signi cant role in understanding biochemical and physiological pathways like to explore new structures and functions [8], pharmacological targets and markers [9], and early detection and bene ts for proteomic and genomic research [10].With the advancement of Computational Biology, it has become easier to analyze hypothetical proteins using bioinformatics tools that provide various advantages like the determination of 3D structural conformation, identi cation of new domains and motifs, assessment of new cascades and pathways, phylogenetic pro ling, and functional annotation [11].
However, due to novel genera under the family of Colwelliaceae, this study intended to characterize the protein EMK97_00595 [Litorilituus sediminis], a family of von Hippel-Lindau (VHL) that have an overwhelming function as a tumor suppressor in higher organisms.The main feature of VHL is that it is a critical regulator of the ubiquitous oxygen-sensing pathway and can act as a substrate recognition component of an E3 ubiquitin ligase complex [12], also promote the degradation of epidermal growth factor receptor, pro-angiogenesis factors, remodeling of the extracellular matrix, and helps in apoptosis resulting tumor suppression [13].
In the higher organism during cellular normoxia when oxygen is available, the cellular HIFα is hydroxylated by prolyl hydroxylase and works as a felicitous substrate for pVHL which is a constitutive active site of E3 ubiquitin ligase.The hydroxyproline of hydroxylated HIFα provides a binding signal for pVHL, which leads to e cient ubiquitylation and proteasomal degradation of HIFα protein.On the other hand, in hypoxia condition HIFα is not prolyl hydroxylated and may escape pVHL recognition, resulting in accumulation of HIFα and formation of a complex with HIF1β, goes into the nucleus and activates a transcriptional program to cope with the short-term, long-term effects of oxygen deprivation, several signaling pathways as well as angiogenesis factor for leading cell proliferation or tumor [13][14].So the function of the hypothetical protein that exists in the Litorilituus sediminis is considerable.
Therefore, this study manifests a reliable interpretation of this hypothetical protein EMK97_00595 (QBG34344.1)by adopting an integrated work ow that can be a potential research interest in the eld of tumor suppression study.

Sequence retrieval and similarity identi cation
The hypothetical protein EMK97_00595 [Litorilituus sediminis] was chosen by exploring the NCBI database which can act as a signi cant research interest in numerous cancer research elds in the near future.The sequence of the hypothetical protein (GenBank Accession: QBG34344.1 and NCBI Reference Sequence: WP_130598461.1)that may contain a tumor suppressor domain was retrieved and collected as a FASTA format and submitted to several prediction servers for the in-silico characterization.Initially, a similarity search was performed using the NCBI BLASTp program [15] against the non-redundant and Swissprot database [16], for predicting the function of the hypothetical protein.

Multiple sequence alignment and phylogeny analysis
A multiple sequence alignment is a tool used to explore closely related genes or proteins to nd the evolutionary relationships between genes and to identify shared patterns among functionally or structurally related genes.Sequence alignment was performed by the MUSCLE server of EBI [17], and an evolutionary relationship was accomplished by Jalview 2.11 software [18], between the hypothetical protein EMK97_00595 and the proteins that had structural similarity with the protein of interest.

Analysis of physicochemical properties
ProtParam [5] is a tool that computes various physical and chemical parameters of protein sequences.
The physicochemical properties of the hypothetical protein were predicted using the ProtParam tool in the ExPASy server [19], which predicts all the relative properties including molecular weight, theoretical pI, amino acid composition, the total number of positive and negative residues, instability index, aliphatic index and grand average of hydropathicity (GRAVY) [20][21] [22].

Analysis of the secondary structure
The servers that are utilized to predict protein secondary structure were SOPMA [23] and PSIPRED [24].SOPMA is a general secondary structure prediction tool, on the other hand, PSIPRED is a server for comprehensive analysis of protein.The server SOPMA was initially employed to predict the secondary structure and then the result derived from the SOPMA server was validated by exploiting PSIPRED.

3D Structure Modeling and Quality Assessment
HHpred server [25] that works based on the pairwise comparison pro le of hidden Markov models, was used to build the 3-dimensional structure using the best scoring template.The con dence of the predicted structure was also visualized by SWISS-MODEL [26].Several quality assessment tools of the SAVES and ProFunc [27] server were applied to estimate the reliability of the predicted 3D structure model of the hypothetical protein.The Ramachandran plot for the model was built using the PROCHECK program [28] to visualize the backbone dihedral angles of amino acid residues.The quality of the protein 3D structure was assessed with the help of the ERRAT server [29] and Varify 3D server was used to determine the compatibility of an atomic model (3D) with its amino acid sequence as well as comparing the results to standard structures [30][31]. .

Active site determination
Computed Atlas of Surface Topography (CASTp) is an online active site determination server [32] that calculates the location, delineation, and concave surface regions on 3D structures of proteins.CASTp predicted the active site of the selected hypothetical protein that showed the binding sites, amino acid binding regions with area and volume.

Identi cation of protein subcellular localization and topology
The subcellular location of the following protein was predicted by using the BUSCA web server [33].

Generation of Protein-protein interaction network
As the proposed investigation seeking a tumor suppressor protein from microorganisms, STRING [57] has been used to summarize the network information of VHL tumor suppressor protein.Because of being a novel microorganism, there is no speci c network is available.Here the VHL protein from humans has been used as a supposition model that might give an intellectual knowledge about VHL protein if it may apply to the human.

Identi cation of sequence homology
The BLASTp result of the FASTA sequence shows the sequence homology with other identical proteins (Tables 1 and 2).Construction of phylogenetic tree using multiple sequence alignment generated from BLASTp result shows the evolutionary relationship of the selected hypothetical protein (WP_130598461.1) in Figure 2.

Analysis of physicochemical properties
The physicochemical properties of a protein can be characterized by an analysis of the analogous properties of the amino acids.The hypothetical protein is negatively charged as the theoretical pI: 4.22 and the total number of positively (Arg + Lys) and negatively charged residues (Asp + Glu) were found to be 10 and 27, respectively.The computed instability index (II) was 32.71 classifying the protein as a stable one.The aliphatic index was 77.37 which gives an indication of proteins' stability over a wide temperature range and all the other properties have been summarized in table 3. The total number of negatively charged residues (Asp + Glu) 27 The total number of positively charged residues (Arg + Lys) 10 The instability index (II) is computed to be 32.71 The total number of atoms 3189 Aliphatic index 77.37 Grand average of hydropathicity (GRAVY) -0.261

Secondary structure analysis
The secondary structure of a protein can be able to provide some worthy information about the function.
The query hypothetical protein shows the percentages of alpha-helix, beta-turn, extended strand, and the random coil of protein 21.13%, 9.91%, 33.33%, and 36.15%,respectively from SOPMA.The results of the secondary structure were also cross-checked by the PRISPRED server which shows a summary of similar results.The representative secondary structure of the hypothetical protein (WP_130598461.1) has been shown in Figure 3.

Number of proline residues 2
Total number of residues 64

Active site calculation
The active site of the selected hypothetical protein constituted by 11 amino acids of an area with 52.957 and a volume of 22.609.Chain X of the hypothetical protein shows the amino acids involved in the active site (F, V, Y, Y, T, L, E, V, T, Q, W), supplementary Figure 6 (A & B).

Assessment of protein subcellular localization and topology
The subcellular localization of the hypothetical protein seems to be an extracellular secretory signal peptide.Protein-sol and SOSUI both predict the hypothetical protein as a soluble protein.HMMTOP, TMHMM predicted the protein as a non-transmembrane protein (Table 5).The predicted topology of the protein has shown here from N terminal to the C terminal.The STRING interaction of VHL protein from Homo sapiens has been shown in Figure 8 as a model.VHL interacts with various proteins based on their combined score (table 7).The network has 11 nodes, 40 edges, average node degree 7.27, local clustering coe cient 0.819, expected number of edges 18, and the p-value of protein-protein interaction enrichment 7.07e-06 indicates the network has signi cantly more interactions than expected.

Discussion
The sequence information as well as the structural information contributes to understanding the function of a hypothetical protein.This study aims to characterize a hypothetical protein, which showed strong homology with VHL superfamily, involved in tumor suppressor.Therefore, the amino acid sequence of the hypothetical protein EMK97_00595 [Litorilituus sediminis] was retrieved, and initially, the physicochemical properties were obtained by ExPASy's ProtParam tool and the prediction results are the deciding factors for the hydrophilicity, stability, and function of the protein [58].The protein was considered as a stable one even in a wide temperature range as the instability index (II) and the aliphatic index were 32.71 and 77.37, respectively.And the query protein seems to be hydrophilic as the GRAVY was -0.261, supplementary table 3.
Protein structure is closely associated with its function.The secondary structure, viz.helix, sheet, turn and therefore the coil of any protein has an excellent association with the structure, function, and interaction of the protein.The query hypothetical protein contains the percentages of alpha-helix, beta-turn, extended strand, and the random coil 21.13%, 9.91%, 33.33%, and 36.15%,respectively.Findings from SOPMA revealed that the protein has an abundance of coiled regions that contributes to higher stability and conservation of the protein structure [58].Moreover, the protein features a reliable helices percentage in its structure, which may facilitate folding by providing more exibility to the structure, thus protein interactions could be increased [59].
For the prediction of the protein 3D model, HHpred was employed, where the highest identical template was selected for getting an acceptable model.The query protein WP_012259469.1 showed the highest template identity of 25% with von Hippel-Lindau disease tumor suppressor; E3 ubiquitin ligase, transcription factor, hypoxic signaling, transcription; [Homo sapiens] with lowest E-value: 1.1e-11.
Ramachandran plot analysis revealed that 91.1% of residues were located in the most favored regions.
Moreover, residues in additional allowed regions and generously allowed regions were 7.1%and 0.0%, respectively, which evaluated the quality of the model to be good and reliable as it is generally accepted that if 90% of residues are in the most favored regions, it is likely to be a reliable model [60], shown in Fig. 4(B).The model is compatible with its sequence as Verify 3D analysis implies that 93.75% of the residues had an average 3D-1D score of ≥0.Litorilituus sediminis is a novel species and the investigated protein EMK97_00595 is also novel so there is no speci c STRING derived protein-protein network is available for this organism.The protein-protein interaction network analysis shown here from Homo sapiens is just for a supposition model to evaluate how the protein interacted in humans.The protein-protein interaction of VHL-HIF1A (Hypoxia-inducible factor 1-alpha) with a combined score of 0.999 indicated a strong relationship between these two proteins.The interaction between VHL and HIF1A indicating the involvement of the same pathway to suppress tumor activity [12].
Overall, the combinational strategy of computing physicochemical properties, evaluating the secondary structure and tertiary structure information, and domain information analysis denoted the protein as Von Hippel-Lindau tumor suppressor protein that is associated with Von Hippel-Lindau disease.

Conclusion
Protein is the building block of life that serves both biological processes and molecular functions in living organisms.Hence, this study investigated the functional role of a hypothetical protein from a novel bacterium, (Litorilituus sediminis) that possesses a signi cant tumor suppression activity.The employment of highly recommended bioinformatics tools to analyze the combinational sequence and structural information revealed the underlying molecular function of the examined hypothetical protein.
The current investigation suggested that the hypothetical protein may exhibit a VHL beta-domain that is similar to the human VHL beta-domain and is also a part of Von Hippel-Lindau tumor suppressor protein (pVHL).Therefore, this nding with the aid of bioinformatics tools can soften our viewpoint for further investigation and experimental validation of this hypothetical protein containing VHL beta domain, and the use of this hypothetical protein with the aid of modern biotechnology might be utilized to suppress tumor progression in higher organisms such as human as an alternative to human defective or mutated VHL protein in the near future.

4 .
Assessment and validation of protein 3-dimensional structure PROCHECK program was used for the validation of predicted tertiary structure, where the distribution of φ and ψ angle in the model within the limits are shown (

Figure 3 Model
Figure 3

Figure 7 Topology of hypothetical protein Figure 8
Figure 7

Table 1 :
Similar proteins obtained from the non-redundant database.

Table 2 :
Similar proteins obtained from Swissprot database

Table 4
quality factor predicted by the ERRAT server was 60.7143 indicates a quality model.From ProFunc, the average G-factors of the hypothetical protein are calculated to be -0.20,which indicates a usual protein model.
, Figure.4).The model was presumed to be a good one according to the Ramachandran Plot Statistics, with 91.1% residues in the most favored regions.Finally, the structure validation server Veri y3D and ERRAT was implicated to verify the established model of 3D structure for the target sequence.In the Verify3D graph, 93.75% of the residues have averaged a 3D-1D score ≥ of 0.2 which indicates that the environmental pro le of the model is good and the overall

Table 4 :
Ramachandran plot statistics of the predicted 3D model for the target protein EMK97_00595 (WP_130598461.1)

Table 5 :
Assessment of subcellular localization The e-value 9.11e-05 of VHL beta domain from ProFunc, 2.71e-09 of VHL superfamily from SCOP, 8.1e-03 of VHL family from Supfam indicate extremely good protein alignment respectively.The overall alignment range of the VHL beta domain was 133-212, VHL superfamily and Family were 144-200 respectively.Protein coil nature was determined by using PCoils from the Bioinformatics toolkit server.According to Phyre 2, the folding pattern of the following hypothetical protein is pre-albumin-like.On the other hand, PEF-FunSeqE is called the protein immunoglobulin-like.Both are secreted protein as well as soluble protein and hence provide a properly de ned similarity indication of VHL protein (Table6).
The initial protein domain was achieved from the Conserved domain database (CDD) of NCBI.The region of the domain, superfamily, and family classi cations have been determined by the servers -CDD, Pfam, SMART, Interpro, SCOP, Supfam, MotifFinder, ProFunc, Phyre 2, and CATH-Gene3D.The domain, Superfamily, and Family were selected based on the lowest e-value of the following domain.The higher evalue has been ltered out from the selection procedure.

Table 7 :
Interacting proteins and their combined score from STRING 11.0 server 2. "Overall quality factor" was estimated by ERRAT, which is used to evaluate the amino acid environment for non-bonded atomic interactions.Higher scores indicate higher quality, and the query protein's quality factor was 60.7143, which is greater than the generally accepted range (>50) for a high-quality model[61].The average G-factor of the query protein is -0.20 obtained from ProFunc analysis, which indicates a usual protein model.Protein's active site was determined by CASTp, containing 11 amino acids (F, V, Y, Y, T, L, E, V, T, Q, W) of an area with 52.957 and a volume of 22.609, shown in gure5(A & B).The subcellular localization obtained from CELLO, BUSCA, and other similar servers, seems to be an extracellular secretory signal peptide and non-transmembrane (Table5).As the functions of secreted proteins are diverse, the query hypothetical protein may work like paracrine, autocrine, endocrine, or neuroendocrine depending on the target[62].Solubility is the most important factor and an excellent index for protein functionality.Protein-sol and SOSUI both predict the hypothetical protein as a soluble one, so it may possess good dispersibility and lead to the formation of nely dispersed colloidal systems.The superfamily, family, and domain information have been determined by a combinational sequence and structural informative approach based on the e-value of different sequence and structure analysis servers.These servers suggested the following hypothetical protein EMK97_00595 from the organism Litorilituus sediminis to be a VHL beta domain from the VHL superfamily.VHL tumor suppressor protein can play role in tumor suppression in multiple ways and the most common of them is targeting the hypoxia-inducible transcription factor (HIF) that mediated tumor suppression activity through polyubiquitylation and proteasomal degradation [63].The major contribution of Von Hippel-Lindau tumor suppressor protein (pVHL) is to suppress clear-cell renal cell carcinoma in kidney cancer [63][64].