Prediction of the relations among drug and other molecular or social entities is the main knowledge discovery pattern for the purpose of drug-related knowledge discovery. Computational approaches have combined the information from different resources and levels for drug-related knowledge discovery, which provides a sophisticated comprehension of the relationship among drugs, targets, diseases, and targeted genes, at the molecular level, or relationships among drugs, usage, side effect, safety, and user preference, at a social level. In this research, previous work from the BioNLP community and matrix or tensor decomposition was reviewed, compared, and concluded, and eventually, the BioNLP open-shared task was introduced as a promising case study representing this area.
Drug-related knowledge discovery is the process of discovering novel drug targets, drug-side effects, drug-drug interactions (DDIs), drug-disease or drug-indications interactions. The novel knowledge discovery has mainly led to better understanding of the molecular bases of drug efficacy, and with focus on the application scenario of new drug discovery, drug development or drug repurposing [
Generally,
In this review, we mainly focus on two typical
BioNLP is the application of NLP methods to biomedical entities such as macromolecules and relation extraction between protein-protein or drug-drug interactions. As a hyponym word for NLP, the definition of BioNLP appeared in the early 1990s [
In this section, we review the development of BioNLP in drug-related knowledge discovery by categorizing the resources for which type of research was performed. Three kinds of text resources, i.e., large-scale curation data, small-scale corpora, and heterogeneous data, were introduced, as well as drug-related discovery research approaches based on them. Here, PubMed and OMIM were introduced as two representatives of large-scale curated data, which as a tradition served for drug-related knowledge discovery for years; corpora emerged from small-scale data aiming for serving high quality text mining upon large text data; and finally, multi-omics data was introduced as heterogeneous data.
Released for the first time in 1996, PubMed has long been the main text resources for the BioNLP community to collect references and abstracts on life sciences and biomedical topics [
The 2014 version of PubMed Medline was explored by Yang et al. [
NER tools were developed, as well, among the BioNLP community, among dozens of popularized NER tools, including tmChem [
In the meantime, emergence of deep learning strategies in NLP propelled bio-NER dramatically, by introducing novel and sophisticated deep neural network training models, in the manner of classifier and word embedding. First, deep learning brought a new generation of neural networks as an effective classifier, i.e., long short-term memory (LSTM) neural networks; Second, deep learning introduced semantics consideration, like word embedding, as input, and enhanced the NER algorithms. For example, Habibi et al.’s work [
As a user-friendly platform run by NCBI, PubTator [
Except PubMed, there were several text resources serving for drug-related knowledge discovery. Online Mendelian Inheritance in Man (OMIM,
OMIM, a popular knowledge base of human genes and genetic disorders, offers enriched text sets for addressing phenotypes of mutated genes. Wang and Zhang [
Besides PubMed text resources for published papers, and OMIM for curated heredity-centric knowledge text, ClinicalTrails.gov is a representative of an electronic health record (EHR) text resource, which was established in 1999 [
EHR data is a popular source information of clinical and transnational research for drug repurposing. Banda et al. [
In all, the development of large data resource knowledge discovery unveiled the following tendencies:
(1) PubMed is still the main open access resource for large scale resource, meanwhile, lack of other text resources with comparable level and restriction of full text access hinder the development of large scale knowledge discovery for bio-text miners.
(2) After years of development, NER of biomedical entities is not technical headache any longer, and make it possible to run comprehensive knowledge extraction tasks.
(3) As a result, a combination of full open access to PubMed-wide knowledge discovery and restricted access EHR data access for drug knowledge is a main research pat- tern in the next decade.
Early attempts to apply BioNLP to knowledge discovery was propelled by the benchmark NLP dataset corpus. A well-structured corpus experiences a rigid evaluation procedure that ensures its usability. The steps included annotation guidelines design, annotation testing, and inter-annotator agreement computation.
The pioneer work was the corpus used in DDIs of DDI 2011 [
Corpora design, and its applications, gradually played substantial roles in drug-related knowledge discovery. In 2016, for the purpose of oncology knowledge discovery, Lee et al. [
Focus on adverse reactions (ADRs) or side effects on drugs has attracted the attention of corpus designers. In that regard, Fang et al. [
Another focus on drug-related corpus construction is on drug repurposing. Until now, the corpus working on drug repurposing was rare. Recent progress came from Wang et al.’s work [
The development of drug-oriented corpora design showed clear tendency as below.
(1) DDIs were a key focus in corpora design, and the DDI corpus has long been a tradition in drug-related corpus construction.
(2) Disease-oriented corpora covered drug- related knowledge curation, which served directly to specific disease and focused on tumors as targets.
(3) Drug-related ADR or side effect information was a focus in corpora design which served for drug effect, and as well led to expanded attention in medical and clinical applications.
(4) Mutation-centric corpus was a novel addition to the drug-related corpora, which was aimed to the application of drug repurposing.
Unlike traditional text data, heterogeneous data is generally non-scientific text, like so cial media and various omics data, including genomic or proteomic data. While the non-scientific text enhanced research studies, with social concerns such as drug abuse, drug misuse, and drug safety, the various omics data achieved success under the collaboration of BioNLP and bioinformatics community.
Just like Twitter served well for drug prescription and drug abuse [
With emergence of multi-omics data, the integration of text data with genome, or protome data attracted attention from a cross disciplinary view, for the purpose of drug-gene linking discovery. Early attempts of linking chemical to candidate genes was performed in late 2000s by Li et al. [
In most cases, multi-omics data integration led to indirect link discovery between drugs and their targeted proteins or candidate loci. Zhang et al. [
To conclude, the availability of the heterogeneous data propelled drug-related knowledge discovery both in social and bioinformatics domains.
(1) Social media data became an exclusively important resources for collecting public opinion, helping to resolve several drug-related topics, such as drug safety, drug usage, or drug side effects.
(2) Integration of text data with multi-omics data became a tendency upon drug-gene linking or therapeutic target discovery, and huge text data was regarded as one member of omics data from the view of the bioinformatics community.
Matrix factorization or decomposition are important techniques for extracting information from a matrix or a tensor [
If compared with great amount and various patterns of BioNLP research on drug-related knowledge discovery, the research of matrix or tensor decomposition was comparatively less, and more topic-specific. In general, the adaptable data structure made it possible to illustrate higher order links, while the lower rank approximation made it a suitable one for novel link discovery. A comprehensive review of mathematical illustration of the matrix decomposition (“also known as matrix factorization”) by Wang and Zhang’s work [
Matrix decomposition obtains a sum of lower-rank matrices, and then models a small number of factors [
In 2013, Zheng et al. [
Similarly, Liu et al.’s work [
Tensor decomposition appeared early in 1927 [
Basically, it was a natural idea to incorporate various drug-related information into the axes of a tensor, and achieve an imaginary knowledge structure. Khan et al. [
The above methods mainly fulfilled tensor axes with various drug-related domain data like gene expression or chemical features, and then a novel link discovery was mined out from the decomposed tensor. Meanwhile, a hybrid strategy of BioNLP and tensor decompostion came from Zhou et al. [
Among the above research studies, the characteristics of matrix or tensor decomposition method enabled investigators to input multiple data, and thus provide more comprehensive information for prediction, which may elevate knowledge prediction accuracy. The research tendency of matrix or tensor decomposition on drug-related knowledge discovery is listed below.
(1) Matrix or Tensors are natural data structures to contain multiple arrays of drug-related entries. Paired knowledge entries are mapped into a matrix element, such as a drug-target, drug-drug pair, while three linked entities are mapped into a cell in tensor, such as “drug”, “user,” and “label,” in drug recommendations. Furthermore, higher order links are mapped into higher order tensors.
(2) Generally, novel link discovery is inferred from the novel nonzero cells in the decomposed matrix or tensor. Methods differ according to the chosen decomposition algorithm. For example, a new link is inferred from a core tensor after decomposition in a RESCAL-based tensor decomposition, while a nonzero cell in the approximated tensor counts as a novel link in a CP decomposition.
(3) Three way tensors were the most popular choice in the knowledge inference applications. As shown in
(4) Knowledge inference algorithms such as jointly decomposed matrices and tensors, bring the data fusion idea into the matrix or tensor decomposition strategy, and make it possible to perform a drug-related knowledge discovery, by incorporating various kinds of heterogeneous data.
The goal of drug-related discovery is to find novel knowledge for extracting drugs, and use the newly identified drugs for disease therapy. In this review, we focused on BioNLP and tensor or matrix decomposition methods to predict novel alternative therapeutic symptoms.
Recent progress in drug-related knowledge discovery led to a couple of research trends:
(1) Well-annotated corpora are a core gold standard dataset. Annotation corpora are crucial to BioNLP, and could help to retrieve and extract information from biomedical text, and also provide standard data for repeatable training and evaluation of BioNLP.
(2) NER tasks are replaced by more complicated knowledge curation tasks, in the BioNLP community. Information from text can be extracted by BioNLP, which could be the original data to find novel knowledge through prediction models. With the recent development of PubTator, NER, and term normalization, are properly solved, while aiming to curate all of PubMed.
(3) The application of BioNLP in drug-related knowledge discovery requires deepened integration of multi-omics data. Cross-disciplinary collaboration among BioNLP, MedNLP, and bioinformatics communities is a promising approach.
(4) Knowledge inference, based on tensor or matrix decomposition, is regarded as a reliable prediction model. The integration of algorithms and theorems, developed in knowledge graphs, is a promising approach to resolve various drug-related knowledge discoveries.
To encourage cross-disciplinary collaboration from various drug-related knowledge discoveries, shared tasks have long been a stage to gather researchers with different backgrounds, e.g., the series of BioNLP Shared Task (BioNLP-ST) workshops [
Aiming to gather text mining approaches among the BioNLP community to propel drug-oriented knowledge discovery, BioNLP Open Shared Task workshop (
AGAC track provides an AGAC and aims to extract mutation-disease knowledge from PubMed. The mutation-disease knowledge in this track links gene-mutation-function change to disease, which not only contains the relationship between mutation and disease, but also indicates the functional change of the mutation, i.e., GOF or LOF. One application of this track is to elevate the efficiency of drug discovery, since matching drugs with their target mutated genes must consider the corresponding of the function change of mutated gene and the pharmacological activities of drugs.
AGAC track contains three different tasks.
(1) Trigger words NER: This task requires participants to recognize trigger words from PubMed abstracts, and annotate them with their corresponding AGAC labels or entities (Var, MPA, Interaction, Pathway, CPA, Reg, PosReg, NegReg, Disease, Gene, Protein, and Enzyme).
(2) Themetic roles identiftcation: Identification of AGAC themetic roles (e.g., Theme Of, Cause Of), between trigger words.
(3) Gene-function mutation-disease link discovery: Extract the gene-(mutation)-function change-biology function or disease link. For example, “Mutations in SHP-2 phosphatase that cause hyperactivation of its catalytic activity have been identified in human leukemias, particularly juvenile myelomonocytic leukemia.” From this sentence, the participants need to extract (SHP-2–GOF–juvenile myelomonocytic leukemia).
The baseline methods for task 1 or 2 was performed in Zhou et al.’s work [
Conceptualization: JX. Formal analysis: MG, YW. Writing - original draft: MG. Writing - review & editing: JX.
No potential conflict of interest relevant to this article was reported.
The authors would like to express their gratitudes to Dr. Kevin Bretonnel Cohen for many interesting discussions and nice suggestions about the paper, as well as Dr. Jin-Dong Kim for many illuminative discussions during the BLAH5 workshop. Gratitude is also expressed to Mr. Kaiyin Zhou, Ms. Xuan Qin, Ms. Yuxin Ren, Ms. Shanghui Nie, and all of the audiences who attended the AGAC discussion in the HZAU BioNLP seminar. This work is funded by the Fundamental Research Funds for the Central Universities of China (Project No. 2662018PY096).
Structure of a matrix and a three way tensor.