Currently, coronavirus disease 2019 (COVID-19) literature has been increasing dramatically, and the increased text amount make it possible to perform large scale text mining and knowledge discovery. Therefore, curation of these texts becomes a crucial issue for Bio-medical Natural Language Processing (BioNLP) community, so as to retrieve the important information about the mechanism of COVID-19. PubAnnotation is an aligned annotation system which provides an efficient platform for biological curators to upload their annotations or merge other external annotations. Inspired by the integration among multiple useful COVID-19 annotations, we merged three annotations resources to LitCovid data set, and constructed a cross-annotated corpus, LitCovid-AGAC. This corpus consists of 12 labels including Mutation, Species, Gene, Disease from PubTator, GO, CHEBI from OGER, Var, MPA, CPA, NegReg, PosReg, Reg from AGAC, upon 50,018 COVID-19 abstracts in LitCovid. Contain sufficient abundant information being possible to unveil the hidden knowledge in the pathological mechanism of COVID-19.
Coronavirus disease 2019 (COVID-19) is an abbreviation for corona virus disease, which caused a pandemic in 2019. People infected with COVID-19 suffers from severe high fever, dyspnea, lung disease and with 0.3%‒1.5% chance of death. Due to the severe condition COVID-19 caused, the research upon the disease has been increasing dramatically. As of January 2021, there are over 90,000 related literature published, and make it a huge repository for knowledge discovery. Such a large growth rate makes it difficult for relevant researchers to understand the massive information in time.
Understanding the mechanism of COVID-19 is of importance for containing the virus. Like severe acute respiratory syndrom virus, it enters cells by binding angiotensin-converting enzyme 2 (ACE2) protein on the surface of human cells with S protein. S protein is located in the outermost layer of COVID-19, and exists in the form of trimer. Each monomer contains a receptor binding domain composed of amino acids where S protein binds to ACE2 and infects human cells.
Compared with the whole vision of the COVID-19 mechanism, the above commonsense knowledge is far from sufficiency. For unveiling the mechanism hidden in the huge text data, application of text mining has drawn a good amount of attentions recently. So far, nearly 200 researches have been published in PubMed, which worked on COVID-19 literature mining. For propelling the COVID-19‒oriented text mining researches, NCBI developed a huge public available COVID-19 corpus, LitCovid [
Fortunately, the Bio-medical Natural Language Processing (BioNLP) community has long focused on fundamental tools development, including bio-medical entity recognition, entity concept normalization, relation extraction, and so forth. For PubMed abstracts and PMC full texts, PubTator [
For example, PubTator is a search database that highlights some keywords in the search results, it's based on the results of PubMed. Pubtator supports six tag types, which are gene, disease, chemical, mutation, species and cell line. The above six kinds of tags are already very useful for unveil hidden mechanism of COVID-19. LitCovid is a reliable corpus which is a collection of texts related to COVID-19. Therefore, when PubTator annotates the LitCovid corpus, the six biological entities in the text will be assigned a corresponding tag. Moreover, the OntoGene’s Bio-medical Entity Recognizer (OGER) [
Considering the need for logical mining, AGAC is good at discovering Regulation relations. Therefore, it is easy to reveal Pathway-like logic. In this research, we release LitCovid-AGAC database. It provides multiple annotations by PubTator, OGER and AGAC.
The purpose of designing AGAC [
An AGAC tagger based on the deep neural network was introduced as a baseline method in AGAC track in BioNLP OST 2019. The baseline fully used sophisticated BERT structure and reached sufficient high quality for sequence labeling [
PubAnnotation [
As shown in
It can be seen that different corpora have different annotation focuses. Other corpora mainly label biological concepts and match them to standard data sets. However, AGAC not only focuses on biological concepts, but also focuses on logical lines in sentences. The same biological concept may be given different labels in different contexts, or even will not be labeled. In this way, we can find that some chemicals up regulate or down regulate gene expression in COVID-19.
By integrating the method mentioned above, we performed an automatic annotation pipeline to obtain the LitCovid-AGAC dataset.
Step 1. Data collection: Collect literature data set from LitCovid [
Step 2. AGAC annotation: Obtain the AGAC annotations by applying AGAC tagger on literature set.
Step 3. Regulation annotation: Create a regulation dictionary on PubDictionary [
Step 4. PubTator and OGER annotation: Import the annotations from PubTator and OGER by using PubAnnotation.
LitCovid-AGAC contains 50,018 abstracts from PubMed, and the annotations are from three sources, AGAC, PubTator and OGER. LitCovid-AGAC aims on the regulations of biological process described in COVID-19 literature. Therefore, we applied all the AGAC labels which contains 5 biological concept labels and 3 regulation labels. To enrich the relative annotation, Mutation, Species, Gene, Disease from PubTator and GO, Chemical Entities of Biological Interest (CHEBI) [
It can be clearly seen that the annotation results of OGER and PubTator are more abundant, on the contrary, the number of AGAC annotations is not in the same order of magnitude as the number of their annotations. It is due to the annotation rules in AGAC that the sentence without the description of regulation is not annotated, so AGAC annotations are less than the annotations from other sources. The more detailed statistics is shown in
Enriched by PubTator and OGER, the data set contained more complete annotations. For instance, in
Besides, the annotations also unveil the molecule-level biological processes. In
With the annotations in LitCovid-AGAC data set, the genes, diseases, variations and the biological processes in cellular-level and molecular-level are connected by the regulations 4 labels in the same sentence. Combining with the semantics information, the sequential order of the regulation events helps to convert them into a directional path which regards regulation label as the edges and the other labels as the nodes. For example, the path in
Combining the annotations in different articles can get a complete logical line. The D614G mutation of spike gene (S gene) in
Combined with the contents of four pictures, we drew the
This example reflects not only the information at the molecular level, but also the information at the cellular level, which proves the feasibility of finding and forming a logical line from different texts. Therefore, an idea can be put forward that we can extract the key knowledge from the massive information and form a large logical network when the number of texts is enough. As a result, more hidden information can be discovered and new knowledge can be inferred.
As indicated in this research, though single annotation is limited for comprehensive bio-medical knowledge discovery upon the huge literature repository for COVID-19, combination of relevant annotations from different resources makes it possible to bring a rich annotation data set which lead to knowledge with complete semantics.
Furthermore, the suggested knowledge pattern by using LitCovid-AGAC is capable of offering a huge amount of structured logic knowledge, and unveiling the pathological mechanism of COVID-19 in cellular or molecular level.
In addition, it as well makes sense to further curate the obtained results in LitCovid-AGAC, e.g., concept normalization, co-reference, and relation extraction. Meanwhile, it is instructive to visualize the knowledge entry in a syntactic way. The VSM box [
Conceptualization: JX. Data curation: YW, SO, KZ. Formal analysis: SO, YW, JX. Funding acquisition: JX. Methodology: SO, JX, YW. Writing - original draft: SO, JX. Writing - review & editing: JX.
No potential conflict of interest relevant to this article was reported.
AGAC corpus:
This work is partially funded by the HZAU intramural innovative science funding, grant no. 2662021JC008. We would like to express our gratitude to many instructive discussion among BLAH7 Hackathon (
The knowledge representation based on the LitCovid-AGAC corpus.
(A, B) A cellular level annotation example of LitCovid-AGAC data set.
(A, B) A molecular level annotation example of LitCovid-AGAC data set.
(A‒E) A light logical network inferred from LitCovid-AGAC data set.
Visualization semantic structure template.
The statistics of LitCovid-AGAC
Name | LitCovid-AGAC |
---|---|
Text type | Title, abstract |
Annotation count | AGAC – Var (444), MPA (1,162), CPA (298), NegReg (1,128), PosReg (402), Reg (1,169) |
LitCovid – Mutation (435), Species (152,939), Gene (23,795), Disease (285,135) | |
OGER – GO (57,467), CHEBI (111,981) |