<italic>In Silico</italic> Signature Prediction Modeling in Cytolethal Distending Toxin-Producing <italic>Escherichia coli</italic> Strains

Maryam Javadi; Mana Oloomi; Saeid Bouzari

doi:10.5808/GI.2017.15.2.69

Genomics Inform > Volume 15(2); 2017 > Article

Javadi, Oloomi, and Bouzari: In Silico Signature Prediction Modeling in Cytolethal Distending Toxin-Producing Escherichia coli Strains

Original Article

Genomics & Informatics 2017; 15(2): 69-80.

Published online: June 15, 2017

DOI: https://doi.org/10.5808/GI.2017.15.2.69

In Silico Signature Prediction Modeling in Cytolethal Distending Toxin-Producing Escherichia coli Strains

Maryam Javadi , Mana Oloomi , Saeid Bouzari

Department of Molecular Biology, Pasteur Institute of Iran, Tehran 13164, Iran.

Corresponding author:
Tel: +98-21-66953311-20, Fax: +98-21-66492619, manaoloomi@yahoo.com

Received April 28, 2017 Revised May 09, 2017 Accepted May 09, 2017

(open-access, http://creativecommons.org/licenses/by-nc/4.0/):

It is identical to the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/).

Abstract

In this study, cytolethal distending toxin (CDT) producer isolates genome were compared with genome of pathogenic and commensal Escherichia coli strains. Conserved genomic signatures among different types of CDT producer E. coli strains were assessed. It was shown that they could be used as biomarkers for research purposes and clinical diagnosis by polymerase chain reaction, or in vaccine development. cdt genes and several other genetic biomarkers were identified as signature sequences in CDT producer strains. The identified signatures include several individual phage proteins (holins, nucleases, and terminases, and transferases) and multiple members of different protein families (the lambda family, phage-integrase family, phage-tail tape protein family, putative membrane proteins, regulatory proteins, restriction-modification system proteins, tail fiber-assembly proteins, base plate-assembly proteins, and other prophage tail-related proteins). In this study, a sporadic phylogenic pattern was demonstrated in the CDT-producing strains. In conclusion, conserved signature proteins in a wide range of pathogenic bacterial strains can potentially be used in modern vaccine-design strategies.

Keywords: biomarkers, cytolethal distending toxin, genomic signature, multiple alignments, pathogenic Escherichia coli

Introduction

The co-evolution of pathogenic bacteria and their hosts leads to the generation of functional pathogen-host interfaces. Well-adapted pathogens have evolved a variety of strategies for manipulating host cell functions to guarantee their successive colonization and survival. For instance, a group of gram-negative bacterial pathogens produces a toxin, known as cytolethal distending toxin (CDT) [¹]. Among the vast majority of CDT producers are Escherichia coli, which is commonly found in the intestines of humans and other mammals. Most E. coli strains are harmless commensals; however, some isolates can cause severe diseases and are designated as pathogenic E. coli. Among the various pathogenic E. coli strains, some have acquired virulence determinants through the horizontal transfer of genes, such as the cdt genes encoding CDTs. CDTs were the first bacterial toxins identified that block the eukaryotic cell cycle and suppress cell proliferation, eventually resulting in cell death. The active subunits of CDT toxins exhibit features of type I deoxyribonuclease-like activity [^2,³].

In this study, comparative genome analysis of CDT-producer E. coli isolates with other pathogenic and commensal strains was performed. Alignments between multiple genomes led to the identification of a set of distinct (“signature”) sequence motifs. These signature sequences could be used to delineate single genomes or a specified group of associated genomes within a desired group, such as the CDT-producing E. coli (the target group in this study). While genomic signatures were conserved in the target group, which they were not conserved or were absent in other related or unrelated genomes (i.e., the background group). From a clinical point of view, conserved signature sequences could offer advantages in predicting and further designing novel CDT inhibitors to vaccine candidates [⁴].

On the other hand, phylogenic trees can be constructed based on multiple sequence alignments. It is important that phylogeny based on an immense number of genes and whole-genome sequences are more reliable than those based on a single gene or a few selected loci [⁵]. Phylogenic analysis can provide an overall classification of the target group among the background group. Alignment of whole-genome sequences yields detailed information on specific differences between genomes and, consequently, has shed new insights into phylogenetic relationships in recent years [^6,^7,^8,⁹].

In this study, phylogenic relationships of CDT⁺ strains with other pathogenic and commensal E. coli strains were assessed, and conserved signature genomic regions in the target group (CDT-producers) were annotated. This information could be used for developing molecular diagnostics assays, polymerase chain reaction primer and probe design in modern vaccines.

Methods

CDT⁺ strains

Several databases were used to identify bacterial strains harboring cdt genes. Data was extracted from the following resources: NCBI, National Center for Biotechnology Information GenBank; EMBL, European Molecular Biology Laboratory; DDBJ, DNA Data Bank of Japan; PDB, Protein Data Bank; RefSeq, NCBI Reference Sequence Database; and UniProtKB, Swiss-Prot Database.

Whole-genome sequences

All genomes analyzed in this study were downloaded from the NCBI file transfer protocol (FTP) site at: ftp://ftp. ncbi.nih.gov/genomes.

Reordering of draft genomes

Ordering and orienting contigs in draft genomes facilitates comparative genome analysis. Contig ordering can be predicted by comparison of a reference genome that is expected to have a conserved genome organization [¹⁰]. ProgressiveMauve (version 2.3.1) was used for ordering contigs in draft genomes. Mauve contig mover (MCM) offers advantages over methods that rely on matches in limited regions near the ends of contigs [^11,¹²]. The E. coli K-12 MG1655 strain (accession No. NC_000913.3) was used as a reference genome.

The MCM optional parameters were used in this study including default seed weight, use seed families: 15 determine Locally Collinear Blocks (LCBs); LCBs, full alignment, iterative refinement, sum-of-pairs LCB scoring, and min LCB weight: 200.

Multiple genome alignments

In this study, Gegenees software (version 2.2.1) was used for multiple-genome alignments. The software is written in JAVA, and making it compatible with several platforms. Limitations were not observed in the speed calculation, number and memory of the genomes that could be aligned. Gegenees software is also capable of performing fragmented alignments [⁴]. Multiple alignments of E. coli genomes were created using a fragment size of 200 nucleotides, a step size of 100 parameters, and BLASTN, which was optimized for highly similar sequences.

Phylogenic tree construction

A phylogram was produced in SplitsTree 4, using the neighbor-joining method and a distance matrix Nexus file exported from Gegenees software [¹³]. E. albertii TW07627 and E. fergusonii ATCC 35469 strains were set as the out-groups.

Identifying conserved signatures

CDT-producing isolates were set as the target group, and all other strains were used as the background group by using the in-group setting tab in Gegenees software. Because of the genomic diversity in CDT-producer E. coli, we repeated this procedure with five different strains, including E. coli 53638, E. coli IHE3034, E. coli RN587/1, E. coli STEC B2F1, and E. coli STEC C165-02, which were defined as separate reference strains.

The biomarker score (max/average) setting was also used. Biomarker scores were drawn graphically and loaded into the tabular view for further data analysis. In the tabular view, a score of 1.0 is the maximum biomarker score and is considered as a signature.

Assembling signature fragments

Several overlapping fragments were obtained, based on the sequences of each reference strain. To facilitate subsequent analysis steps, the overlapping fragments were assembled using DNA Dragon software, version 1.6.0 (http://www.dna-dragon.com/).

The settings were designed with minimum overlaps (100 bases) along the diagonal length, a minimum %-identity of complete overlapping fragments, and 100% full-search parameters.

BLAST

BLAST was done with sequences for each of the five reference strains by using NCBI BLASTX (http://blast.ncbi.nlm.nih.gov/Blast.cgi) to identify the putative protein domains. Furthermore, putative conserved domains were also detected. The results were confirmed using the Uni-ProtKB Bank BLASTX program (http://www.uniprot.org/blast/).

Results

Strains

The sequences of 76 strains were downloaded from the NCBI site. Details regarding genome sizes, %GC content, the number of encoded proteins, encoded genes, genome type, pathotype, serotype, other characteristics, and accession numbers are summarized in Table 1. Most data presented were extracted from NCBI GenBank and UniProt Bank and some information was extracted from original articles [^14,¹⁵]. The genomes of 24 strains were drafted, and a reordering process of the draft genomes was performed. Twenty-five CDT⁺ E. coli strains were analyzed, including E. albertii TW07627.

Phylogenic analysis

A heat-plot based on a 200/100 BLASTN fragmented alignment drawn with Gegenees software is shown in Fig 1. A phylogenic overview is also shown in the heat-plot. A more detailed phylogram was constructed with SplitsTree 4 software, as shown in Fig. 2.

CDT-producer E. coli strains were displayed a sporadic, phylogenomic pattern in the heat-plot, with a lack of a consensus pattern. Six distinct genomic groups of CDT⁺ strains (T1 to T6 in Fig. 2) were shown in the phylogram, all of which were sporadic among the strains in Fig 1. As a sporadic pattern of CDT-producing strains was observed in the bacterial population in the phylogram for specific clades, these strains were related and some degrees of similarity were also found.

Signature sequences in the target group

In total, 1,527 fragments representing 3.0% of the E. coli 53638-strain genome were identified as signature sequences. Biomarkers were restricted to 21 highly significant regions, designated A to U. When E. coli IHE3034 was set as the reference strain, 220 signature sequences (0.4%) were detected. Biomarkers were identified in six regions, designated A to F. However, 1,512 (2.9%) signature fragments were obtained, which were restricted to 18 regions (A to R) in the genome of E. coli RN587/1 when it was regarded as the reference strain. Moreover, 620 biomarker fragments (1.2%) were detected in the genome of E. coli STEC B2F1 when it was set as the reference strain, 16 biomarker regions (A to P) were recognized. In addition, when E. coli STEC C165-02 was used as the reference strain, 593 signature fragments (1.1%) were identified, which were restricted to eight regions (A to H). The signature regions for all reference strains are shown in Fig. 3, separately. In addition, the biomarker designation, domain description, BLASTX results and related putative conserved domains for each reference strain are provided in Supplementary Tables 1, 2, 3, 4, 5, 6.

Conserved signature proteins

The most common biomarker proteins were distinguished by comparing BLASTX results for all reference strains fragments (Table 2). The signature proteins identified included: CDT, holin, lambda-family proteins, nuclease, phage integrase family proteins, phage tail tape measure family proteins, putative membrane proteins, regulatory proteins, restriction-modification system proteins, tail fiber assembly proteins, baseplate assembly proteins, tail fiber protein and other prophage tail related proteins, terminuses and transferases. The nucleotide sequences of some proteins including anti-termination proteins, prophage DNA packaging and binding proteins, transposase and DNA transposition proteins, scaffold proteins, recombination-related domains, putative phage-replication proteins, hemolysin, helicase, glycol transferase, and glycohydrolase superfamilies, were detected as biomarkers in the target group, although these BLASTX results were not observed in all reference strains. Presumably, CDT-producer E. coli strains possess several hypothetical proteins whose functions are not yet defined and might be conserved proteins. The existence of these DNA biomarker sequences in reference strains is clear; however, the related proteins in some strains have not been determined.

Significant putative conserved domains and superfamilies

In the era of modern vaccines, finding conserved domains or epitopes has a great therapeutic value. Putative conserved domains were described as non-specific hits (NH), specific hits (SH), and multi-domains (MD), and it was shown in Supplementary Tables 1, 2, 3, 4, 5, 6.

The putative conserved domains and superfamilies that were associated with some signature proteins are shown below.

- NH: PRK15251, DUF4102, CdtB, CDtoxinA, INT_P4, HP1_INT_C, Phage_integrase, INT_Lambda_C, Phage_integ_N, Methylase_S, Caudo_TAP, phage_tail_N, Tail_P2_I, gpI, phage_term_2, Terminase_3, Terminase_5, M, Phage_term_smal, COG5525, Terminase_GpA, Phage_Nu1, dexA, Phage_holin_2, DUF3751, Phage_attach, dcm, DNA_methylase, Cyt_C5_DNA_methylase, Dcm, Glycos_transf_2, and CESA_like
- SH: INT_REC_C, PhageMin_Tail, COG4220, Phage_fiber_2, HSDR_N, Glycos_transf_2, GT_2_like_d, PRK-10018, and PLN02726
- MD: PRK09692, int, recomb_XerC, XerD, xerC, HsdS, N6_Mtase, HsdM, hsdM, rumA, P, Terminase_6, COG-5484, PLN03114, COG5301, COG0610, hsdR, PRK-10458, PRK10073, Glyco_tranf_2_3, WcaA, PRK10073, and PTZ00260
- Superfamilies: RICIN superfamily, EEP superfamily, DNA_BRE_C superfamily, DUF4102 superfamily, Phage_integ_N superfamily, MCP_signal superfamily, Methylase_Ssuperfamily, Caudo_TAPsuperfamily, phage_tail_Nsuperfamily, Tail_P2_Isuperfamily, Terminase_3superfamily, Terminase_5superfamily, Phage_term_smalsuperfamily, Terminase_GpAsuperfamily, Phage_Nu1superfamily, DnaQ-like-exosuperfamily, Phage_holin_2superfamily, DUF3751 superfamily, Phage_fiber_2superfamily, Gifsy-2 superfamily, HSDR_Nsuperfamily, Cyt_C5_DNA_methylase superfamily, MethyltransfD12superfamily, Glyco_transf_GTA type superfamily, and Glyco_transf_GTA typesuperfamily

Discussion

The synchronic evolution of bacterial pathogens and virulence-associated determinants encoded by horizontally transferred genetic elements has been observed in several species. However, E. coli is a normal member of the intestinal microflora of humans and animals. E. coli strains have acquired virulence factors by the attainment of particular genetic loci through horizontal gene transfer, transposons, or phages. These elements frequently encode multiple factors that enable bacteria to colonize the host and initiate disease development [¹⁶]. CDTs belong to one such class of virulence-associated factors. CDT was first identified in E. coli by Johnson and Lior in 1988 [¹⁷]; since then several studies have been reported that CDTs can be produced by intestinal and extra-intestinal pathogenic bacteria [¹⁸].

In this study, the genomes of 25 CDT+ E. coli strains were acquired from several gene banks. Multiple genome comparisons with 49 CDT⁻ E. coli strains, including EPEC (enteropathogenic E. coli), ETEC (enterotoxigenic E. coli), STEC (Shiga toxin-producing E. coli), EAEC (enteroaggregative E. coli), EIEC (enteroinvasive E. coli), AIEC (adherent invasive E. coli), UPEC (uropathogenic E. coli), ExPEC (extraintestinal pathogenic E. coli), EHEC (enterohemorrhagic E. coli), environmental strains and commensal strains were performed.

In fact, phylogenic analysis based on whole-genome information is more accurate than those based on one gene or a set of limited genes. In this study, CDT-producing strains were not shown a phylogenomic relationship or pattern. Indeed, while they might carry the same or similar virulence gene sets, they also possess their own divergent genomic structures. This is probably because of their complex and distinct evolutionary pathways, indicating an independent acquisition of mobile genetic elements during their evolution.

The sporadic pattern in the phylogenomic dendrogram confirmed previous findings that CDT⁺ strains are heterogeneous. The heterogeneous nature of CDT-producing strains might arise from horizontal gene transfer through mobile genetic elements. These genetic exchanges that occur in bacteria provide genetic diversity and versatility [¹⁹].

A significant challenge in comparative genomics is the utilization of large datasets to identify specific sequence signatures that are biologically important or are useful in diagnosis [^4,²⁰]. In this study, we define CDT-producing E. coli as the target group and found regions that were conserved that could serve as genomic signatures for the target group. Because of the heterogeneous genomic nature of CDT⁺ E. coli, five reference strains were selected instead of one, including EIEC, ExPEC, EPEC, STEC B2F1, and STEC C165-02. Moreover, in the phylogenomic overview, these five reference strains were selected from different clades of the phylogenic tree, representing the T1–T6 groups.

The findings was presented in this study indicate that the major conserved biomarkers beyond CDT were exonuclease, phage integrase, putative membrane, and tail-fiber proteins. Furthermore, with signature proteins of a targeted group, it was shown that phage-related proteins and virulence-associated factors could be commonly transferred by phages. Moreover, in the putative conserved domains of biomarker proteins, phage-related superfamilies were frequently observed. As a result, cdt genes were used as a signature sequences in CDT-producing E. coli strains, and it was shown that they can be used as a powerful biomarker.

In this study, the most significant signature proteins in the five E. coli strains were identified using in-silico whole-genome sequences. It was demonstrated that conserved signature proteins were expressed in a wide range of pathogenic bacterial strains, which could be used in future studies in a broad range of research applications and in modern vaccine-design strategies.

Acknowledgments

This work was supported financially by the Pasteur Institute of Iran. We would like to thank Editage (http://www.editage.com) for English language editing.

Supplementary materials

Supplementary data including six tables can be found with this article online at http://www.genominfo.org/src/sm/gni-15-69-s001.pdf.

Supplementary Table 1

Signature details based on Escherichia coli 53638 reference

gni-15-69-s001.pdf

Supplementary Table 2

Signature details based on Escherichia coli IHE3034 reference

gni-15-69-s002.pdf

Supplementary Table 3

Signature details based on Escherichia coli RN587/1 reference

gni-15-69-s003.pdf

Supplementary Table 4

Signature details based on Escherichia coli STEC_B2F1 reference

gni-15-69-s004.pdf

Supplementary Table 5

Signature details based on Escherichia coli STEC_C165_02 reference

gni-15-69-s005.pdf

Supplementary Table 6

Aalphabetic abbreviation and description of putative conserved domains

gni-15-69-s006.pdf

References

1. Lara-Tejero M, Galan JE. Cytolethal distending toxin: limited damage as a strategy to modulate cellular functions. Trends Microbiol 2002;10:147–152. PMID: 11864825.

2. Tóth I, Nougayrède JP, Dobrindt U, Ledger TN, Boury M, Morabito S, et al. Cytolethal distending toxin type I and type IV genes are framed with lambdoid prophage genes in extra-intestinal pathogenic Escherichia coli. Infect Immun 2009;77:492–500. PMID: 18981247.

3. Lara-Tejero M, Galán JE. A bacterial toxin that controls cell cycle progression as a deoxyribonuclease I-like protein. Science 2000;290:354–357. PMID: 11030657.

4. Agren J, Sundström A, Håfström T, Segerman B. Gegenees: fragmented alignment of multiple genomes for determining phylogenomic distances and genetic signatures unique for specified target groups. PLoS One 2012;7:e39107. PMID: 22723939.

5. Rokas A, Williams BL, King N, Carroll SB. Genome-scale approaches to resolving incongruence in molecular phylogenies. Nature 2003;425:798–804. PMID: 14574403.

6. Dubchak I, Poliakov A, Kislyuk A, Brudno M. Multiple whole-genome alignments without a reference organism. Genome Res 2009;19:682–689. PMID: 19176791.

7. Paten B, Earl D, Nguyen N, Diekhans M, Zerbino D, Haussler D. Cactus: algorithms for genome multiple sequence alignment. Genome Res 2011;21:1512–1528. PMID: 21665927.

8. Blanchette M, Kent WJ, Riemer C, Elnitski L, Smit AF, Roskin KM, et al. Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res 2004;14:708–715. PMID: 15060014.

9. Rausch T, Emde AK, Weese D, Döring A, Notredame C, Reinert K. Segment-based multiple sequence alignment. Bioinformatics 2008;24:i187–i192. PMID: 18689823.

10. Rissman AI, Mau B, Biehl BS, Darling AE, Glasner JD, Perna NT. Reordering contigs of draft genomes using the Mauve aligner. Bioinformatics 2009;25:2071–2073. PMID: 19515959.

11. Darling AE, Mau B, Perna NT. progressiveMauve: multiple genome alignment with gene gain, loss and rearrangement. PLoS One 2010;5:e11147. PMID: 20593022.

12. Darling AC, Mau B, Blattner FR, Perna NT. Mauve: multiple alignment of conserved genomic sequence with rearrangements. Genome Res 2004;14:1394–1403. PMID: 15231754.

13. Kloepper TH, Huson DH. Drawing explicit phylogenetic networks and their integration into SplitsTree. BMC Evol Biol 2008;8:22. PMID: 18218099.

14. Lukjancenko O, Wassenaar TM, Ussery DW. Comparison of 61 sequenced Escherichia coli genomes. Microb Ecol 2010;60:708–720. PMID: 20623278.

15. Gardner SN, Hall BG. When whole-genome alignments just won't work: kSNP v2 software for alignment-free SNP discovery and phylogenetics of hundreds of microbial genomes. PLoS One 2013;8:e81760. PMID: 24349125.

16. Asakura M, Hinenoya A, Alam MS, Shima K, Zahid SH, Shi L, et al. An inducible lambdoid prophage encoding cytolethal distending toxin (Cdt-I) and a type III effector protein in enteropathogenic Escherichia coli. Proc Natl Acad Sci U S A 2007;104:14483–14488. PMID: 17726095.

17. Johnson WM, Lior H. A new heat-labile cytolethal distending toxin (CLDT) produced by Escherichia coli isolates from clinical material. Microb Pathog 1988;4:103–113. PMID: 2849027.

18. Kim JH, Kim JC, Choo YA, Jang HC, Choi YH, Chung JK, et al. Detection of cytolethal distending toxin and other virulence characteristics of enteropathogenic Escherichia coli isolates from diarrheal patients in Republic of Korea. J Microbiol Biotechnol 2009;19:525–529. PMID: 19494702.

19. Oloomi M, Bouzari S. Molecular profile and genetic diversity of cytolethal distending toxin (CDT)-producing Escherichia coli isolates from diarrheal patients. APMIS 2008;116:125–132. PMID: 18321363.

20. Edwards DJ, Holt KE. Beginner's guide to comparative bacterial genome analysis using next-generation sequence data. Microb Inform Exp 2013;3:2. PMID: 23575213.

Fig. 1

Phylogenetic heat-plot overview of multiple-genome alignments. A heat plot based on a 200/100 BLASTN fragmented alignment was performed with Gegenees software. Six distinct genomic groups (T1–T6) recognized in cytolethal distending toxin (CDT)⁺ strains were observed sporadically among the strains that were studied, revealing the heterogeneous genomic nature of CDT-producing Escherichia coli.

Fig. 2

Phylogram overview. A phylogram was generated using SplitsTree 4 software, using the neighbor-joining method and a distance-matrix Nexus file exported from Gegenees software. The Escherichia albertii TW07627 and Escherichia fergusonii ATCC 35469 strains were set as out-groups. In addition, six unique groups (T1–T6) were analyzed. In the phylogenetic overview, a sporadic pattern of cytolethal distending toxin (CDT)–producing strains was observed, as were specific clades. These strains were related and their similarities were shown. CDT⁺ strains are shown in boxes. The Escherichia coli strains that were set as reference strains for biomarker-detection studies are indicated with red arrows.

Fig. 3

Biomarker regions. Biomarker regions were illustrated in the whole-genome sequences of five different reference strains including Escherichia coli 53638, E. coli IHE3034, E. coli RN587/1, E. coli STEC B2F1, and E. coli STEC C165-02. The biomarker score (max/average) setting was used. A score of 1.0 is the maximum biomarker score, which was considered to represent a signature sequence, as indicated in green. STEC, Shiga toxin-producing E. coli.

Table 1.

Strains characteristics

Strain	DNA length (Mb)	cdt gene	GC%	Protein count	Gene count	Genome type, No. of subsequences/contigs	Pathotype, serotype, other characteristic	Accession No.
Escherichia coli 96.0497	5.01426	＋	50.80	4,862	5,026	Draft, 13	Host: homo sapiens, O91:H21	NZ_AEZQ00000000.2
Escherichia coli 3003	4.91733	＋	50.7	4,825	4,982	Draft, 8	I.S: water, O157:H45	NZ_AFAF00000000.2
Escherichia coli 5412	5.38651	＋	50.20	5,670	5,761	Draft, 373	Host: homo sapiens, SFO157	NZ_AMUJ00000000.1
Escherichia coli 53638	5.37179	＋	50.99	4,803	5,218	Draft, 2	EIEC, O144	NZ_AAKB00000000.2
Escherichia coli ARS4.2123	4.98276	＋	50.50	5,105	5,194	Draft, 209	I.S: water, O157:H16	NZ_AMUL00000000.1
Escherichia coli DEC3F	5.4079	＋	50.30	5,541	5,692	Draft, 93	Host: homo sapiens, SF EHEC O157:H	NZ_AIFJ00000000.1
Escherichia coli KTE11	4.52715	＋	50.50	4,109	4,214	Draft, 7	No published information	NZ_ANSR00000000.1
Escherichia coli KTE28	5.0544	＋	50.40	4,673	4,760	Draft, 12	No published information	NZ_ANSY00000000.1
Escherichia coli KTE47	4.98747	＋	50.60	4,694	4,798	Draft, 11	No published information	NZ_ANUB00000000.1
Escherichia coli KTE60	5.07079	＋	50.50	4,664	4,756	Draft, 20	No published information	NZ_ANUJ00000000.1
Escherichia coli KTE137	5.00154	＋	50.50	4,702	4,789	Draft, 99	No published information	NZ_ANYA00000000.1
Escherichia coli KTE178	5.30789	＋	50.60	4,973	5,050	Draft, 11	No published information	NZ_ANTB00000000.1
Escherichia coli KTE180	5.12548	＋	50.60	4,883	4,966	Draft, 112	No published information	NZ_ANYR00000000.1
Escherichia coli KTE209	5.11008	＋	50.50	4,702	4,791	Draft, 3	No published information	NZ_ANXD00000000.1
Escherichia coli MS 21-1	5.30899	＋	50.40	5,744	5,860	Draft, 206	No published information	NZ_ADTR00000000.1
Escherichia coli O157:H- 493-89	5.05482	＋	50.50	4,838	4,946	Draft, 204	Host: homo sapiens, O157:H-	NZ_AETY00000000.1
Escherichia coli O157:H43 T22	4.95898	＋	50.80	4,859	4,935	Draft, 64	I.S: milk from healthy cattle, O157:H43	NZ_AHZD00000000.2
Escherichia coli RN587/1	5.06158	＋	50.60	4,999	5,108	Draft, 73	EPEC, O157:H8	NZ_ADUS00000000.1
Escherichia coli STEC B2F1	4.98941	＋	50.90	4,875	5,006	Draft, 37	STEC, O91:H21	NZ_AFDQ00000000.1
Escherichia coli STEC C165-02	5.00927	＋	50.60	4,891	5,019	Draft, 30	STEC, O73:H16	NZ_AFDR00000000.1
Escherichia coli TA271	5.07582	＋	50.70	5,081	5,197	Draft, 83	Host: some mammal	NZ_ADAZ00000000.1
Escherichia coli TW06591	5.47546	＋	50.30	5,521	5,650	Draft, 45	Host: homo sapiens, O157:H-	NZ_AKLT00000000.1
Escherichia coli W26	5.11853	＋	50.60	4,852	4,920	Draft, 165	Host: cow, I.S: feces	NZ_AGIA00000000.1
Escherichia albertii TW07627	4.74659	＋	49.90	4,386	4,889	Draft, 43	Diarrhea genic	NZ_ABKX00000000.1
Escherichia coli APEC O1	5.49765	＋	50.29	4,853	4,968	Complete, 3	ExPEC, O1:K1:H7, avian pathogenic	NC_008563.1
Escherichia coli IHE3034	5.10838	＋	50.70	4,966	4,753	Complete, 1	ExPEC, O18:K1:H7, meningitis	NC_017628.1
Escherichia coli 042	5.35532	-	50.58	4,920	5,036	Complete, 2	EAEC, O44:H18	NC_017626.1
Escherichia coli 536	4.93892	-	50.50	4,619	4,779	Complete, 1	UPEC, O6:K15:H31	NC_008253.1
Escherichia coli 55989	5.15486	-	50.70	4,755	5,136	Complete, 1	EAEC	NC_011748.1
Escherichia coli ABU 83972	5.13296	-	50.60	4,795	4,905	Complete, 2	ExPEC UTI, OR:K5:H-	NC_017631.1
Escherichia coli APEC O78	4.79843	-	50.70	4,588	4,695	Complete, 1	ExPEC	NC_020163.1
Escherichia coli ATCC 8739	4.74622	-	50.90	4,199	4,408	Complete, 1	K12 derivative	NC_010468.1
Escherichia coli B REL606	4.62981	-	50.80	4,200	4,361	Complete, 1	Commensal, strain B	NC_012967.1
Escherichia coli BL21 DE3	4.55895	-	50.80	4,153	4,330	Complete, 1	Commensal, strain B	NC_012971.2
Escherichia coli BW2952	4.57816	-	50.80	4,079	4,262	Complete, 1	K12 derivative	NC_012759.1
Escherichia coli CFT073	5.23143	-	50.50	5,364	5,574	Complete, 1	ExPEC, UPEC, O6:K2:H1	NC_004431.1
Escherichia coli DH1	4.63071	-	50.80	4,160	4,375	Complete, 1	K12 derivative	NC_017625.1
Escherichia coli E24377A	5.24929	-	50.54	4,991	5,258	Complete, 7	ETEC, O139:H28	NC_009801.1
Escherichia coli ED1a	5.20955	-	50.70	4,911	5,321	Complete, 1	Commensal, O81	NC_011745.1
Escherichia coli ETEC H10407	5.32589	-	50.73	4,872	5,084	Complete, 5	ETEC, O78:H11	NC_017633.1
Escherichia coli HS	4.64354	-	50.80	4,374	4,626	Complete, 1	Commensal, O9	NC_009800.1
Escherichia coli IAI1	4.70056	-	50.80	4,345	4,629	Complete, 1	Commensal	NC_011741.1
Escherichia coli IAI39	5.13207	-	50.60	4,725	5,092	Complete, 1	ExPEC, UPEC, O7:K1	NC_011750.1
Escherichia coli JJ1886	5.30828	-	50.77	5,049	5,213	Complete, 6	ExPEC, UPEC	NC_022648.1
Escherichia coli K-12 DH10B	4.68614	-	50.80	4,124	4,352	Complete, 1	K12 derivative	NC_010473.1
Escherichia coli K-12 MG1655	4.64165	-	50.80	4,140	4,497	Complete, 1	Commensal, K12	NC_000913.3
Escherichia coli K-12 W3110	4.64633	-	50.80	4,213	4,436	Complete, 1	Commensal, K12	NC_007779.1
Escherichia coli KO11FL	5.02717	-	50.79	4,705	4,821	Complete, 2	Commensal	NC_017660.1
Escherichia coli LF82	4.77311	-	50.70	4,376	4,545	Complete, 1	AIEC	NC_011993.1
Escherichia coli LY180	4.8356	-	50.90	4,463	4,624	Complete, 1	Ethanologenic E. coli	NC_022364.1
Escherichia coli NA114	4.97146	-	51.20	4,873	4,975	Complete, 1	ExPEC, UPEC	NC_017644.1
Escherichia coli O7:K1 CE10	5.37873	-	50.58	5,080	5,269	Complete, 5	ExPec, Neonatal meningitis, O7:K1	NC_017646.1
Escherichia coli O26:H11 11368	5.85553	-	50.66	5,515	5,985	Complete, 5	EHEC, O26:H11	NC_013361.1
Escherichia coli O55:H7 CB9615	5.45235	-	50.48	5,117	5,367	Complete, 2	EPEC, O55:H7	NC_013941.1
Escherichia coli O83:H1 NRG 857C	4.89488	-	50.71	4,582	4,690	Complete, 2	AIEC, O83:H1	NC_017634.1
Escherichia coli O103:H2 12009	5.52486	-	50.68	5,117	5,541	Complete, 2	EHEC, O103:H2	NC_013353.1
Escherichia coli O104:H4 2011C-3493	5.43741	-	50.63	5,149	5,269	Complete, 4	EAEC/STEC, O104:H4	NC_018658.1
Escherichia coli O111:H- 11128	5.76608	-	50.42	5,403	5,931	Complete, 6	EHEC, O111:H	NC_013364.1
Escherichia coli O127:H6 E2348 69	5.06968	-	50.55	4,647	5,011	Complete, 3	EPEC, O127:H6	NC_011601.1
Escherichia coli O157:H7 EC4115	5.70417	-	50.39	5,477	6,066	Complete, 3	EHEC, O157:H7	NC_011353.1
Escherichia coli O157:H7 EDL933	5.6394	-	50.45	5,772	5,920	Complete, 2	EHEC, O157:H7	NC_002655.2
Escherichia coli O157:H7 Sakai	5.59448	-	50.45	5,292	5,448	Complete, 3	EHEC, O157:H7	NC_002695.1
Escherichia coli O157:H7 TW14359	5.62274	-	50.46	5,363	5,586	Complete, 2	EHEC, O157:H7	NC_013008.1
Escherichia coli P12b	4.93529	-	50.90	4,379	4,567	Complete, 1	O15:H17	NC_017663.1
Escherichia coli PMV 1	5.21093	-	50.67	4,979	5,257	Complete, 2	ExPEC, O18:K1	NC_022370.1
Escherichia coli S88	5.16612	-	50.66	4,823	5,187	Complete, 2	ExPEC, Neonatal Meningitis, O45:K1:H7	NC_011742.1
Escherichia coli SE11	5.15563	-	50.75	4,996	5,103	Complete, 7	Commensal, O152:H28	NC_011415.1
Escherichia coli SE15	4.83968	-	50.71	4,486	4,592	Complete, 2	Commensal, O150:H5	NC_013654.1
Escherichia coli SMS-3-5	5.21538	-	50.50	4,912	5,127	Complete, 5	Environmental isolate	NC_010498.1
Escherichia coli UM146	5.10756	-	50.61	4,783	4,891	Complete, 2	AIEC (adherent invasive)	NC_017632.1
Escherichia coli UMN026	5.3582	-	50.64	5,010	5,294	Complete, 3	ExPEC, UPEC, O7:K1	NC_011751.1
Escherichia coli UMNK88	5.66676	-	50.74	5,607	5,754	Complete, 6	Porcine ETEC, O149	NC_017641.1
Escherichia coli UTI89	5.17997	-	50.61	5,162	5,272	Complete, 2	ExPEC, UPEC, O18:K1:H7	NC_007946.1
Escherichia coli W	5.00886	-	50.78	4,602	4,876	Complete, 3	Commensal, ATCC 9637	NC_017635.1
Escherichia coli Xuzhou21	5.51674	-	50.38	5,179	5,294	Complete, 3	EHEC, O157:H7	NC_017906.1
Escherichia fergusonii ATCC 35469	4.64386	-	49.88	4,314	4,543	Complete, 2	I.S: Feces, human	NC_011740.1

I.S, isolation source; EIEC, enteroinvasive E. coli; EHEC, enterohemorrhagic E. coli; EPEC, enteropathogenic E. coli; STEC, Shiga toxin-producing E. coli; ExPEC, extraintestinal pathogenic E. coli; EAEC, enteroaggregative E. coli; UPEC, uropathogenic E. coli; ETEC, enterotoxigenic E. coli; AIEC, adherent invasive E. coli.

Table 2.

Significant signature proteins in five reference Escherichia coli strains

Signature protein	Reference strain
Signature protein	Escherichia coli 53638	Escherichia coli IHE3034	Escherichia coli RN587/1	Escherichia coli STEC_B2F1	Escherichia coli STEC_C165_02
Cytolethal distending toxin	Cytolethal distending toxin A	Cytolethal distending toxin, subunit C	Cytolethal distending toxin A/C family protein	Cytolethal distending toxin C	Cytolethal distending toxin A/C family protein
	Cytolethal distending toxin B	Cytolethal distending toxin, subunit B		Cytolethal distending toxin A/C family protein
	Cytolethal distending toxin subunit C	Cytolethal distending toxin, subunit A		Cytolethal distending toxin A/C family protein
Holin	Phage holin, lambda family	Holin, lambda family	Holing	-^a	Phage holin, lambda family
Nuclease	Exodeoxyribonuclease 8	Exonuclease family protein	Exonuclease family protein	Endonuclease/Exonucl ease/phosphatase family protein	Restriction endonuclease family protein
			Hypothetical protein ECRN5871_4153, [HNH endonuclease family protein]	Type I site-specific deoxyribonuclease, HsdR family	Type I site-specific deoxyribonuclease, HsdR family protein
					Hypothetical protein ECSTECC16502_028 0, [HNH endonuclease]
					Endonuclease/Exonucl ease/phosphatase family protein
Phage integrase	Phage integrase	Integrase/recombinase, phage integrase family	Integrase	Phage integrase family protein	Integrase
	Prophage integrase	Site-specific recombinase, phage integrase family	Phage integrase family protein	Prophage lambda integrase	Prophage CP4-57 integrase
	Integrase for prophage CP-933T		Prophage lambda integrase
	Integrase for prophage CP-933T		Integrase domain protein
Putative membrane protein	Putative membrane protein	Hypothetical protein ECOK1_2122, [membrane protein]	Outer membrane autotransporter barrel domain protein	Putative membrane protein	Putative membrane protein
	Hypothetical protein Ec53638_1156, [membrane protein]	Hypothetical protein ECOK1_2557,		Hypothetical protein ECSTECB2F1_3192, [membrane protein]
				OmpA-like transmembrane domain protein
				Outer membrane porin protein LC
				Outer membrane protein lom
Regulatory proteins	Phage regulatory protein Cro	Putative transcriptional regulator DicA157	Regulatory protein CII	Transcriptional regulator, AraC family	4-Hydroxyphenylaceta te catabolism regulatory protein HpaA
	Transcriptional regulator, AlpA family	Putative regulatory protein Cox	Prophage CP4-57 regulatory protein family protein		Prophage CP4-57 regulatory protein family protein
	Putative phage regulatory protein, Rha family	Putative regulatory protein Cox	Transcriptional regulator, LacI family		Prophage CP4-57 regulatory protein family protein
Restriction-modification system	Putative type I restriction-modification system, S subunit	-^a	Type II restriction enzyme EcoRII	Type I restriction modification DNA specificity domain protein	Type I restriction-modification system specificity determinant
	Type I restriction-modification system specificity subunit		Modification methylase EcoRII		Type III restriction enzyme, res subunit
	Type I restriction-modification enzyme, R subunit		Type I restriction enzyme specificity protein
	Type I restriction-modification system, M subunit		Type I restriction-modification system, M subunit
Tail fiber assembly family, baseplate assembly proteins, Tail fiber protein and Tail tape measure protein	Tail fiber assembly protein	Tail fiber protein	Caudovirales tail fiber	Tail fiber assembly	Caudovirales tail fiber assembly family protein
	Phage P2 baseplate assembly protein gpV	Phage tail tape measure protein	Assembly family protein	Hypothetical protein ECSTECB2F1_0901, [tail fiber assembly protein, caudovirales tail fiber assembly protein]	Tail fiber
	Phage P2 baseplate assembly protein gpV		Hypothetical protein ECRN5871_3504,[tail fiber assembly protein]		Tail fiber domain protein
	Putative tail fiber protein		Baseplate assembly protein V, W	Caudovirales tail fiber assembly family protein	Phage tail fiber repeat family protein
	Tail fiber		Long tail fiber protein p37 domain protein	Prophage tail fiber family protein
	Phage tail tape measure protein family		Tail fiber domain protein	Phage tail fiber repeat family protein
	Phage tail tape measure protein family		Phage tail tape measure protein, TP901 family, core region	Phage tail tape measure protein, lambda family
Terminase	Phage terminase large subunit	-^a	Phage small terminase subunit	Phage terminase large subunit family protein	Terminase small subunit
	Terminase		Terminase, ATPase subunit		Terminase B protein domain protein
			Terminase, endonuclease subunit		Terminase B protein
			Terminase large subunit
			Terminase small subunit
Transferase	Pyruvyl transferase	-^a	Hypothetical protein	Putative teichuronic acid biosynthesis glycosyltransferase tuaG	Acetyltransferase family protein
	Glycosyl transferase domain protein, group 2 family		ECRN5871_3051, [nucleotidyl transferase, PF08843 family]	Glucose-1-phosphate thymidylyltransferase	Hypothetical protein ECSTECC16502_1 295, [acetyltransferase]
	Glycosyltransferase, sugar-binding region containing		D12 class N6 adeninespecific DNA methyltransferase family protein	RTX toxin acyltransferase family protein
	DXD motif		Hypothetical protein ECRN5871_0025, [N-acetyltransferase CN5]	Acetyl-CoA acetyltransferase

^a There are lots of hypothetical proteins with unknown function in desired genome which they have mentioned but their roles have not been defined yet.