The gene sequences submitted to NCBI are recorded in GenBank format with a unique key that is generated by the combination of the accession number and version number. The accession number consists of 1 letter and 5 numerals or 2 letters and 6 numerals for nucleotides and 3 letters and 5 numerals for proteins. The GenBank format is structurally divided into meta information, feature information, and sequence information. In this system, due to the types of viral targets, only “gbvrl” data among NCBI GenBank data are collected and used (
ftp://ftp.ncbi.nlm.nih.gov/genbank). Currently, it is available from “gbvrl1.seq.gz” to “gbvrl51. seq.gz” (2017/12/20). However, GenBank data are highly redundant due to frequently overlapping submissions. This means that computing or parsing after collecting or manipulating GenBank data is an extremely inefficient process. Therefore, NCBI has provided RefSeq data to minimize redundancy, and there are presently 9,557 complete viral genomic RefSeq sequences. Meanwhile, in International Classification of Viruses (ICNV; the name before being revised to ICTV), the first virus taxonomy of 1,971 included 19 genera and 2 families (
Papovaviridae and
Picornaviridae), while 24 groups were unassigned until the appropriate classification levels were determined [
4]. In the current 10th virus taxonomy on the ICTV website, based on the final version (“ICTVMasterSpeciesList2016v1.3”), there are 4,404 species, whereas there are 9,556 complete genome sequences of viral species in GenBank RefSeq (
Table 1).
The goal in our web-based system is to extend the basic information in the ICTV taxonomy database in order to include strain and isolate group and to provide raw data of genomic sequences, as well as history and PubMed information for user-chosen viruses. As a prerequisite, the 10th ICTV taxonomy, which is the most recent, must be parsed. However, ICTV does not provide taxonomy history through OpenAPI. Thus, we collected the data on the taxonomy history and the linked node information via web scraping.