Genome-Based Virus Taxonomy with the ICTV Database Extension

Shinduck Kang; Young-Chang Kim

doi:10.5808/GI.2018.16.4.e22

Abstract

In 1966, the International Classification of Viruses (ICNV) was established to standardize the naming of viruses. In 1975, the organization was renamed “International Committee on Taxonomy of Viruses (ICTV),” by which it is still known today. The primary virus classification provided by ICTV in 1971 was for viruses infecting vertebrates, which includes 19 genera, 2 families, and 24 unclassified groups. Presently, the 10th virus taxonomy has been published. However, the early classification of viruses was based on clinical results “in vivo” and “in vitro,” as well as on the shape of the Phenotype virus. Due to the development of next-generation sequencing and the accompanying bioinformatics analysis pipelines, a reconstruction of the classification system has been proposed. At a meeting held in Boston, USA between June 9–11, 2016, there was even an in-depth discussion regarding the classification of viruses using metagenomic data. One suggested activity that arose from the meeting was that viral taxonomy should be reconstructed, based on genotype and bioinformatics analysis “in silico.” This article describes our efforts to achieve this goal by construction of a web-based system and the extension of an associated database, based on ICTV taxonomy. This virus taxonomy web system was designed specifically to extend the virus taxonomy up to strain and isolation, which was then connected with the NCBI database to facilitate searches for specific viral genes; there are also links to journals provided by the EMBL RESTful API that improves accessibility for academic groups.

Keywords: ICTV DB extension, virus history, virus taxonomy searching web

Introduction

Presently, there are 3,279 virus reference genomes registered in NCBI. More than 1.8 million sequences are included in GenBank (https://www.ncbi.nlm.nih.gov/genbank/) [1]. The number of whole-genome sequences in GenBank is rapidly increasing, as shown in Fig. 1. Currently, only about 1,800 genome sequences have been assigned to species in International Committee on Taxonomy of Viruses (ICTV); the remaining 1,400 sequences have not been classified as species. Although ICTV is responsible for viral classification, it does not have the capacity to immediately formulate the naming conventions and taxonomy for the large number of viral sequences that is submitted to the organization.

However, with the advent of next-generation sequencing and the enhancement of NCBI GenBank data, the classical ICTV method of viral classification based on phenotypic parameters has been converting to a classification based on genotypic classification due to improvements in the speed and accuracy associated with virus taxonomy. Recently, a metagenomic method based on genotype was proposed as an approach to aid virus taxonomy [2]. However, this creates a requirement for an appropriate data handling and analysis pipeline to cope with such needs.

Since the development of web servers in 1993, bioinformatics data have been provided through browser-based systems. Many analysis tools, such as MSA, BLAST, and Genome Browser, have been developed for end users. In the case of ICTV taxonomy and naming, the initial ICTVdB was developed with flat data (DELTA: DEscription Language for TAxonomy), which was not connected to other databases. ICTVdB did not contain any sequence information but was used for phylogenetic analysis [3]. Presently, the 10th ICTV virus taxonomy has been published and is available on the ICTV website (http://ictv.global/report/). However, there is no easy approach to NCBI GenBank data based on ICTV taxonomy, strain, or isolation information for selected viral species, because ICTV taxonomy has only been providing up to the species level. Also, web-based ICTV taxonomy does not provide direct PubMed access, which facilitates academic searches. As a result, our virus taxonomy website reinforces this problem and extends the related tables in the ICTV database.

Methods

ICTV taxonomy and virus history

The gene sequences submitted to NCBI are recorded in GenBank format with a unique key that is generated by the combination of the accession number and version number. The accession number consists of 1 letter and 5 numerals or 2 letters and 6 numerals for nucleotides and 3 letters and 5 numerals for proteins. The GenBank format is structurally divided into meta information, feature information, and sequence information. In this system, due to the types of viral targets, only “gbvrl” data among NCBI GenBank data are collected and used (ftp://ftp.ncbi.nlm.nih.gov/genbank). Currently, it is available from “gbvrl1.seq.gz” to “gbvrl51. seq.gz” (2017/12/20). However, GenBank data are highly redundant due to frequently overlapping submissions. This means that computing or parsing after collecting or manipulating GenBank data is an extremely inefficient process. Therefore, NCBI has provided RefSeq data to minimize redundancy, and there are presently 9,557 complete viral genomic RefSeq sequences. Meanwhile, in International Classification of Viruses (ICNV; the name before being revised to ICTV), the first virus taxonomy of 1,971 included 19 genera and 2 families (Papovaviridae and Picornaviridae), while 24 groups were unassigned until the appropriate classification levels were determined [4]. In the current 10th virus taxonomy on the ICTV website, based on the final version (“ICTVMasterSpeciesList2016v1.3”), there are 4,404 species, whereas there are 9,556 complete genome sequences of viral species in GenBank RefSeq (Table 1).

The goal in our web-based system is to extend the basic information in the ICTV taxonomy database in order to include strain and isolate group and to provide raw data of genomic sequences, as well as history and PubMed information for user-chosen viruses. As a prerequisite, the 10th ICTV taxonomy, which is the most recent, must be parsed. However, ICTV does not provide taxonomy history through OpenAPI. Thus, we collected the data on the taxonomy history and the linked node information via web scraping.

Virus taxonomy database

We collected the ICTV taxonomy from the “ICTV Master Species List,” which was officially announced in ICTV in 2016 (Table 2); the taxonomy history was obtained by web scraping. Furthermore, in order to extend the resource including strain and isolation information and to connect to the viral GenBank information, we downloaded “gbvrl1.seq.gz”~“gbvrl51.seq.gz” (the GenBank virus file) using an FTP protocol and classified the data according to ICTV taxonomy criteria. Currently, the classification table, which is designed in the current ICTV database, includes classification name, classification level, release number and year, classification ID (composed of 8 digits), the most recent classification change ID (composed of 8 digits), parent classification name, change status, and proposal documents [5].

However, in our web-based system, the current ICTV database was redesigned and divided as the tables in our database (Table 3). Specifically, to enhance the database and to make useful linkages for NCBI accession, the NCBI Taxonomy items described in Table 3 and the items parsed by web scraping were built as an “ICTV history” table and “ICTV Taxonomy” table in Table 3, respectively. The “2016 ICTV Species” table consists of the data parsed by the ICTV Master Species List (2016, v1.3).

Web construction

Our web-based system includes the traditional taxonomy (order, family, subfamily, genus, and species), as well as the information regarding strain and isolate. Furthermore, users can easily access journals containing information related to the published virus GenBank via the EMBL RESTful protocol [6] and directly download and reuse NCBI FASTA and “gbk” file via the Entrez openAPI [7]. However, the information of the chosen PubMed and strain and isolate are based on NCBI accession in (Table 3). The journal search is connected by the parameters of the HTML get method, which is indicated by the PubMed ID. Web scraping methods were used to build the “ICTV history” table in our database after extracting meaningful information from the NCBI raw data. The tables in our database form the foundation of the web system. In the Spring framework, the web system consists of Java, which is independent of the operating system. According to the user commands, the internal parsing process is executed by pipelines that are implemented by the BioPython module. The internal parsing process extracts the information of the virus taxonomy, history, and reference articles from XML data, which are produced by the EMBL RESTful API, and text files of the NCBI virus GenBank. The overall process map for the web system is described in Fig. 2.

Results and Discussion

The aim of this study was to evaluate and develop a computerized system that is fused with bioinformatics. Specifically, we focused on implementing an environment that extends the capabilities of the ICTV web system and connects to PubMed in order to enhance searches performed by academic groups. We extended and rebuilt the database and extracted meaningful data using a pipeline that parses XML, text, and web contents. Henceforth, this computerized system will be continually extended and used as a web tool that can detect new viral types and classify them rapidly and accurately. Recently, a new virus classification system based on metagenomics has been proposed. Thus, web-based virus taxonomy could augment the quality by adding virus classification, which is derived by viral metagenomics analysis [8]. We suggest that the web system, analytical pipelines, and extended database we describe herein could be used to add these metagenomics data to ICTV taxonomy data.

Supplement. Detail description of ICTV Extension DB

Supplementary data can be found with this article online at https://doi.org/10.5808/GI.2018.16.4.e22.

1. Table Specification

1-1.Table Specification of “ICTV History” (Table 3)

Table Specification
System Name	SVI	Date	2018.03.20		Registrant		S.D.Kang
Table Name	TN_ICTV_HIS
Table Description	ICTV History Description Table
Column	Type	Length	NULL	PK	FK	Default	Column Description
taxon_node	VARCHAR	8	Y				Taxonomy Node
ictv_nm	VARCHAR	22	Y				Virus Name
ictv_taxon_new	VARCHAR	150	Y				Taxonomy New Name
mod_year	VARCHAR	4	Y				Modification Year
mod_status	VARCHAR	22	Y				Modification Status
ictv_taxon_old	VARCHAR	150	Y				Taxonomy Old Name
ictv_taxon_ref	VARCHAR	150	Y				Taxonomy Reference

Index Definition
NO	Index		Column ID		Order

DDL
CREATE TABLE tn_ictv_his(taxon_node character varying(8) COLLATE utf8_bin, ictv_nm character varying(22) COLLATE utf8_bin ,ictv_taxon_new character varying(150) COLLATE utf8_bin ,mod_year character varying(4) COLLATE utf8_bin, mod_status character varying(22) COLLATE utf8_bin ,ictv_taxon_old character varying(150) COLLATE utf8_bin ,ictv_taxon_ref character varying(150) COLLATE utf8_bin ) COLLATE utf8_bin REUSE_OID;

1-2.Table Specification of “2016 ICTVSpecies”(Table 3)

Table Specification
System Name	SVI	Date	2018.03.20		Registrant		S.D.Kang
Table Name	TN_ICTV_2016_v13
Table Description	2016 ICTV Species List (Version 13)
Column	Type	Length	NULL	PK	FK	Default	Column Description
ictv_order	VARCHAR	255	Y				Order Name
ictv_family	VARCHAR	255	Y				Family Name
ictv_subfamily	VARCHAR	255	Y				Subfamily Name
ictv_genus	VARCHAR	255	Y				Genus Name
ictv_spe ecies	VARCHAR	4096	Y				Species Name
ictv_main_species	CHAR	1	Y				Main Species
ncbi_acces	VARCHAR	255	Y				NCBI Accession
isolate	VARCHAR	255	Y				NCBI Isolation
gene_com	VARCHAR	255	Y				Genome Composition
ictv_status	VARCHAR	255	Y				Last Change Status
proposal	VARCHAR	500	Y				Proposal Link
taxon_node	VARCHAR	8		Y			Taxonomy Node ID

Index Definition
NO	Index		Column ID		Order
1	pk		taxon_node		1

DDL
CREATE TABLE tn_ictv_2016_v13(ictv_order character varying(255) COLLATE utf8_bin, ictv_family character varying(255) COLLATE utf8_bin, ictv_subfamily character varying(255) COLLATE utf8_bin, ictv_genus character varying(255) COLLATE utf8_bin, ictv_spe ecies character varying(4096) COLLATE utf8_bin, ictv_main_species character(1) COLLATE utf8_bin, ncbi_acces character varying(255) COLLATE utf8_bin, isolate character varying(255) COLLATE utf8_bin, gene_com character varying(255) COLLATE utf8_bin, ictv_status character varying(255) COLLATE utf8_bin, proposal character varying(500) COLLATE utf8_bin, taxon_node character varying(8) COLLATE utf8_bin NOT NULL ) COLLATE utf8_bin REUSE_OID; ALTER TABLE tn_ictv_2016_v13 ADD CONSTRAINT pk PRIMARY KEY (taxon_node);

1-3.Table Specification of “ICTV Taxonomy Nodes” (Table 3)

Table Specification
System Name	SVI	Date	2018.03.20		Registrant		S.D.Kang
Table Name	ICTV_TAXON_NODE_ID
Table Description	Taxonomy ID for All Virus Taxonomy Name
Column	Type	Length	NULL	PK	FK	Default	Column Description
taxon_node	VARCHAR	8		Y			ICTV Taxonomy Node ID
class_nm	VARCHAR	12	Y				Class Name
ictv_nm	VARCHAR	22	Y				Virus Name

Index Definition
NO	Index		Column ID		Order
1	pk		taxon_node		1

DDL
CREATE TABLE ictv_taxon_node_id( taxon_node character varying(8) COLLATE utf8_bin NOT NULL, class_nm character varying(12) COLLATE utf8_bin, ictv_nm character varying(22) COLLATE utf8_bin ) COLLATE utf8_bin REUSE_OID; ALTER TABLE ictv_taxon_node_id ADD CONSTRAINT pk PRIMARYKEY (taxon_node);

1-4.Table Specification of “NCBI Taxonomy” (Table 3)

Table Specification
System Name	SVI	Date	2018.03.20		Registrant		S.D.Kang
Table Name	NCBI_GB_STRAIN_ISOLATION
Table Description	Taxonomy, Species,Strain and Isolation Info for NCBI GenBank
Column	Type	Length	NULL	PK	FK	Default	Column Description
ncbi_acces	VARCHAR	255		Y
ncbi_genus	VARCHAR	255	Y
ncbi_spe ecies	VARCHAR	4096	Y
ncbi_taxon	VARCHAR	4096	Y
ncbi_strain	VARCHAR	255	Y
ncbi_isolate	VARCHAR	255	Y

Index Definition
NO	Index		Column ID		Order
1	pk		ncbi_acces		1

DDL
CREATE TABLE ncbi_gb_refseq_taxon( ncbi_acces character varying(255) NOT NULL, ncbi_genus character varying(255), ncbi_spe ecies character varying(4096), ncbi_taxon character varying(4096), ncbi_strain character varying(255), ncbi_isolate character varying(255) ) COLLATE iso88591_bin; ALTER TABLE ncbi_gb_refseq_taxon ADD CONSTRAINT pk PRIMARY KEY (ncbi_acces);

2. System Diagram

System and Data ppub Diagram (Used Cubrid 9.x DBMS)

ICTV items	Example 1	Example 2
Taxon name	Measles virus	Measles morbillivirus
Taxon level	Species	Species
Release number	30	31
Release year	2015	2016
Taxon ID (stable)	19750163	19750163
Node ID (new with each release)	20151044	20161044
Parent taxon	Morbillivirus	Morbillivirus
Last change	Move	Rename
Proposal	2015.Pneumoviridae. pdf	2016.Paramyxovir idaespren.pdf

ICTV history	2016 ICTV species	ICTV taxonomy	NCBI taxonomy
ICTV taxon node	ICTV order	Taxon node (PK)	NCBI accession (PK)
ICTV name	ICTV family	Taxon type	NCBI genus
ICTV new taxon	ICTV subfamily	Taxon name	NCBI species
Modification year	ICTV genus		NCBI taxon
Modification status	ICTV species		NCBI strain
ICTV old taxon	ICTV main species		NCBI isolate
ICTV proposal	NCBI accession Isolation name Gene type ICTV status ICTV proposal