Introduction
Biomedical text mining (also known as BioNLP) refers to text mining applied to texts and literature of the biomedical and molecular biology domain [
1,
2]. For biomedical text mining, a corpus is needed, wherein a corpus is a large and structured set of texts electronically stored and processed [
3].
Full text of
Genomics & Informatics in Portable Document Format (PDF) has been archived in
Genomics & Informatics home pages since 2003 [
4], where content of the journal is available immediately upon publication without an embargo period. As of July 18, 2018, 499 full text articles are available as a corpus resource, under the terms of the Creative Commons Attribution Non-Commercial license [
5,
6]. To make the corpora more useful for conducting biomedical text mining, they are subjected to a process known as annotation, the practice of adding interpretative linguistic information to a corpus.
Therefore, in this study, we report on developing our new corpus called GNI Corpus, with statistics of annotated objects of the journal. The initial objective of developing GNI Corpus was to analyze counts frequencies of words, and to analyze current trends of the journal.
The Text Preprocess and Annotation Framework
Initially, we wrote a simple Python-based Web crawler, to browse and download PDF files from
Genomics & Informatics archives. Then, we converted them into plain text files, using PDFMiner or other optical character recognition (OCR) tools [
7]. The goal was to transform an image of a text into a readable text.
The next step was annotation. According to Bernardi
et al. (2002) [
8], the biological literature is characterized by heavy use of domain-specific terminology, wherein more than 12% of words found in biochemistry publications are technical terms. Therefore, we used NLTK module, for general text processing [
9,
10], and GENIA tagger for recognizing biological terms [
11–
13]. The annotation result used as an example sentence from our dataset, “
Most RFLP markers (
80%)
were pepper-derived clones and these markers were evenly distributed all over the genome.” is as follows:
(‘Most’, ‘Most’, ‘JJS’, ‘B-NP’, ‘O’) (‘RFLP’, ‘RFLP’, ‘NN’, ‘I-NP’, ‘B-DNA’) (‘markers’, ‘marker’, ‘NNS’, ‘I-NP’, ‘I-DNA’) (‘(’,‘(’, ‘(’, ‘O’, ‘O’) (‘80’, ‘80’, ‘CD’, ‘B-NP’, ‘O’) (‘%’, ‘%’, ‘NN’, ‘I-NP’, ‘O’) (‘)’,‘)’, ‘)’, ‘O’, ‘O’) (‘were’, ‘be’, ‘VBD’, ‘B-VP’, ‘O’) (‘pepper-derived’, ‘pepper-derived’, ‘JJ’, ‘B-NP’, ‘B-cell_line’) (‘clones’, ‘clone’, ‘NNS’, ‘I-NP’, ‘I-cell_line’) (‘and’, ‘and’, ‘CC’, ‘O’, ‘O’) (‘these’, ‘these’, ‘DT’, ‘B-NP’, ‘O’) (‘markers’, ‘marker’, ‘NNS’, ‘I-NP’, ‘O’) (‘were’, ‘be’, ‘VBD’, ‘B-VP’, ‘O’) (‘evenly’, ‘evenly’, ‘RB’, ‘I-VP’, ‘O’) (‘distributed’, ‘distribute’, ‘VBN’, ‘I-VP’, ‘O’) (‘all’, ‘all’, ‘DT’, ‘B-ADVP’, ‘O’) (‘over’, ‘over’, ‘IN’, ‘B-PP’, ‘O’) (‘the’, ‘the’, ‘DT’, ‘B-NP’, ‘O’) (‘genome’, ‘genome’, ‘NN’, ‘I-NP’, ‘O’) (‘.’, ‘.’, ‘.’, ‘O’, ‘O’).
Four different levels of tags are attached for each word in the example sentence: base form, POS tag, chunk tag, and named-entity tag. For example, (‘RFLP’, ‘RFLP’, ‘NN’, ‘I-NP’, ‘B-DNA’) represent the part of speech of the word RFLP (restriction fragment length polymorphism) is a noun (‘NN’), and that the word is internal to a noun phrase (‘I-NP’), and a begin phrase of a DNA term (‘B-DNA’).
Specifically, the first tag is a morphological tag to represent a base form of a word. The second tag (based on Penn Treebank tag sets [
14]) is a grammatical part-of-speech (POS) tag, needed for analysis of a sentence identifying constituent parts of sentences such as nouns, verbs, and adjectives. The third tag is a syntactic-level tag that links POS tag to higher order units termed chunks that have discrete grammatical meanings such as noun phrases, verb phrases, or other grammatical phrases [
15]. For chunk tags, IOB notation was used, wherein the B/I/O terminology refers to begin phrase (B), internal to phrase (I), and outside of phrase (O) [
16]. The last tag is a semantic-level tag to classify named entities in text into pre-defined categories such as proteins, DNAs, RNAs, cell lines, and cell types [
17,
18].
The Current Status of the GNI Corpus
Presently, we have annotated 499 full texts of Genomics & Informatics. Among 2,867,430 words, we have marked up 88,629 names with different semantic classes, including 77,626 proteins, 7,293 DNAs, 1,436 RNAs, 226 cell lines, and 2,048 cell type tags.
GNI Corpus will be consistently updated in quantity and quality, by manually and automatically. Developing our own version of POS tagger is underway. Future work also includes enhancement of the existing GENIA ontology and co-reference structures.