Perspectives on Clinical Informatics: Integrating Large-Scale Clinical, Genomic, and Health Information for Clinical Care

In Young Choi; Tae-Min Kim; Myung Shin Kim; Seong K. Mun; Yeun-Jun Chung

doi:10.5808/GI.2013.11.4.186

Abstract

The advances in electronic medical records (EMRs) and bioinformatics (BI) represent two significant trends in healthcare. The widespread adoption of EMR systems and the completion of the Human Genome Project developed the technologies for data acquisition, analysis, and visualization in two different domains. The massive amount of data from both clinical and biology domains is expected to provide personalized, preventive, and predictive healthcare services in the near future. The integrated use of EMR and BI data needs to consider four key informatics areas: data modeling, analytics, standardization, and privacy. Bioclinical data warehouses integrating heterogeneous patient-related clinical or omics data should be considered. The representative standardization effort by the Clinical Bioinformatics Ontology (CBO) aims to provide uniquely identified concepts to include molecular pathology terminologies. Since individual genome data are easily used to predict current and future health status, different safeguards to ensure confidentiality should be considered. In this paper, we focused on the informatics aspects of integrating the EMR community and BI community by identifying opportunities, challenges, and approaches to provide the best possible care service for our patients and the population.

Keywords: clinical data warehouse, database, electronic health records, medical informatics

Introduction

The development of electronic medical record (EMR) systems began as a means to document clinical activities for in-patients and out-patients [1]. They have evolved as the primary front-line patient care clinical tool for medical professionals. The completion of the Human Genome Project opened the era of research in genomics and proteomics. Genome research provides keys to understanding the mechanisms of disease. In clinical informatics, the widespread adoption of the EMR system has generated large amounts of heterogeneous clinical data-some structured and others unstructured. In addition, the explosive health-related contents from online communities, mobile applications, and electronic personal health records increased the availability of non-traditional data on individual activities and life style [2]. In genetics, since the completion of the Human Genome Project in 2003, the acquisition, analysis, and presentation of whole-genomic data has become faster, cheaper, and more reliable day by day [3]. Such dramatic technological advances affect the development of new prevention, diagnosis, and treatment patterns for routine clinical care. The massive amount of heterogeneous data from two different domains is expected to provide personalized, preventive, and predictive healthcare services in the near future [4].

Integrated use of EMR and bioinformatics is beginning to influence the changes in the research paradigm-that is, rapid introduction of new concepts into the point of care. Dr. Want used clinical bioinformatics (CBI) with the definition of "the clinical application of bioinformatics-associated sciences and technologies to understand molecular mechanisms and potential therapies for human disease" [5]. CBI aims to deal with the challenge of integrating genomic and clinical data to accelerate the translation of knowledge into effective treatment plan development and personalized prescription. It is to assist clinicians in various ways, including new biomarker discovery, identification of genotype and phenotype correlations, and pharmacogenomics at the point of care.

Biomedical informatics is a more popular term. Biomedical informatics is defined as an emerging, multi-disciplinary field, and it is the integration of the computational methods and diverse technologies used in life science research, such as genomics, proteomics, systems biology, computer sciences, and healthcare applications, such as electronic health records (EHRs) [6]. The adoption of EMRs enables one to conduct comprehensive phenotypic-genotypic association studies using the genotypes obtained from whole genome sequencing of a given cohort in combination with the phenome data of the same population, as available in the EMR database, as shown in Fig. 1.

The major challenge in enabling such convergent research is to provide easy storage, user-friendly visualization, speedy analysis, knowledge generation, and presentation of clinically relevant information at the point of care. Relevant information should be extracted and linked with medical records in a clinically applicable manner. The convergence of discovery research for clinical implementation can only be accomplished through stringent data management, analysis, interpretation, and quantification in a multidisciplinary research environment. In addition, since genetic information can be easily identifiable, ethical issues, such as informed consent and stewardship over this database, also should be considered as the data grow. The communities of EMRs and bioinformatics (BI) have different histories. While the EMR community focused on clinical activities and clinical workflows, the BI community originated from the biological research community, which included physicists, computer scientists, statisticians, and clinical researchers. What are the optimal ways to integrate the tremendous advances in BI into routine clinical work? In this paper, we will primarily focus on the informatics aspects of this large question by identifying opportunities, challenges, and approaches for the ultimate goal of providing the best care possible for our patients and the population.

Recent Projects Using Bio-enabled EHR

A number of research projects using large-scale health record datasets have been conducted in various communities around the globe. For example, the NSF BIGDATA program solicitation (http://www.nsf.gov), which is partially funded by the National Institutes of Health (NIH), includes large-scale data collection and analysis. This program aims to advance the core scientific and technological means of managing, analyzing, visualizing, and extracting useful information from large, diverse, distributed, and heterogeneous datasets.

INBIOMEDvision (http://www.inbiomedvision.eu) is a two-year initiative funded by the European Commission 7th Framework Program of Information and Communication Technologies (ICT), with the aim of bridging the communities of bioinformatics and medical informatics. It is a coordination support action (FP7) that has the aim of promoting biomedical informatics by means of permanent monitoring of the scientific state-of-the-art and existing activities in the theme, execution of prospective analyses on the emerging challenges and opportunities, and dissemination of knowledge in the field [7].

The Electronic Medical Records and Genomics (eMERGE) Network is a national consortium organized by National Human Genome Research Institute (NHGRI) to develop, disseminate, and apply approaches to research. It combines DNA biorepositories with EMR systems for large-scale, high-throughput genetic research with the ultimate goal of returning genomic testing results to patients in a clinical care setting. The network is currently exploring more than a dozen phenotypes (with 13 additional electronic algorithms having already been published). Various models of returning clinical results have been implemented or planned for pilot at sites across the network. Themes of bioinformatics, genomic medicine, privacy, and community engagement are of particular relevance to eMERGE [8].

In addition to nationwide collaboration projects, there is a research project using large-scale EHRs. Hanauer et al. [9] used large-scale, longitudinal EHR data to conduct research on associations of medical diagnoses and to explore patterns of specific disease progression. Lin et al. [10] proposed a symptom-disease treatment (SDT) association rule, mining a comprehensive EHR of approximately 2.1 million records from a major hospital. Based on selected International Classification of Disease (ICD-9) codes, they were able to identify clinically relevant and accurate SDT associations from patient records in seven distinct diseases, ranging from cancers to chronic and infectious diseases.

Challenges for Genome-Enabled EMR

Research combining biology and clinical information must address "the storage, retrieval, analysis, and dissemination of molecular information in a clinical setting," as suggested by the America Medical Informatics Association (AMIA)-initiated genomics working group [11]. The AMIA proposed specific areas as follows: 1) development of a database structure to unify clinical and genomic data, 2) connecting biology information with patient health records, 3) development of a genome-enabled EMR system, 4) linking clinical trial and drug discovery information, 5) supporting the development benchmark clinical/molecular datasets, 6) developing clinical decision-support tools utilizing molecular information, and 7) visualizing and modeling the molecular basis of disease.

The major challenge is how to integrate the heterogeneous data into one database system. Should there be a single database or should one consider a federated model? Furthermore, one should consider various cases by various users, which would determine the overall system architecture. It is not possible to move large amounts of genomic data. Then, how and where should the high-intensity computation be managed? The expected raw sequencing data for one person is approximately 4 terabytes. The integrated database can have a potential impact on the prevention, diagnosis, and treatment of disease. To make this desire come true, it is important to connect genomics data with clinical information. Genetic test results are already used to assess the risk of breast cancer patients, determine the potential adverse drug reactions on individual patient metabolism, and identify treatment plans for cancers. Genetic test results have suggested a diagnosis for patients with neuropathy, inflammatory bowel disease, and Proteus syndrome and have guided therapeutic care for patients with arterial calcifications, movement disorders, and Miller syndrome [12-17]. The number of applications of genomics in diagnosing diseases and guiding treatment procedures in the clinic will continue to increase.

Key Informatics Issues to Consider

The integrated use of EMR and BI data needs to consider four key informatics areas: data modeling, analytics, standardization, and privacy.

Data modeling and data warehouse

Data modeling and warehouse are two key concepts within proper systems architecture for medicine and bioinformatics. The first generation of clinical data warehouses (CDWs) is mostly stored in commercial relational database management systems and collects structured contents, and the healthcare community has developed a large clinical database, which is called the CDW [18]. The extraction, transformation, and load is essential for converting and integrating distributed healthcare data. Ad hoc reporting tools using online analytical processing technology are used to gain intuitive and simple analysis results in clinical informatics.

Bioclinical data warehouses that integrate heterogeneous patient-related clinical or omics data should be considered. Applications analyzing bioclinical data warehouses will include genetic epidemiology and evaluation of decision-support systems before production systems.

In less than a decade, the Human Genome Project has been established to generate a large amount of biological data to practice diagnostics, prognostics, and therapeutics [19]. The central products using standardized data model are GenBank [20], SWISS-PROT [21], Exon-Intron [22], and IMGT [23].

However, standardized data models for integrated bioclinical data warehouses have not been developed yet. This area will be important to allow researchers to rapidly spread information throughout the world and inspire thousands of research projects.

The experience in designing storages for digital radiological imaging, also known as picture archiving and communication system (PACS), may provide some guidance in bioinformatics data warehouses. In PACS, one has multiple image storages that are not necessarily part of EMR; however, the diagnostic reports are part of the EMR. The PACS images are then made available to physicians as needed.

Analytics

The analytics technology that is commonly used for these systems is mainly traditional data mining techniques developed in the 1980s. The popular term of analysis technology in the business and computer science communities in the 1990s was business intelligence. Recently, technologies that require advanced and unique data storage, management, analysis, and visualization became important in applications that are very large and complex databases.

Standardization

The lack of standardized vocabularies for clinical informatics has hampered the development of automated clinical decision support systems. Finding the laboratory term representing the same meaning of serum sodium for different database systems is troublesome in multi-center data integration. The National Library of Medicine has developed the Unified Medical Language System (UMLS) to enable these different vocabularies to be interoperable by developing a vocabulary at least at a basic level [24]. The same problems are known in bioinformatics. DNA sequences have different names and are joined in some databases only with varying levels of confidence. Codification of molecular diagnostic or cytogenetic results using existing medical vocabularies will have difficulties due to a lack of sufficient terms for molecular findings. For example, the Systematic Nomenclature of Medicine (SNOMED) has minimal codes related to the description of molecular diagnostic findings. The Logical Observation Identifiers Names and Codes (LOINC) vocabulary has recently added a significant number of molecular pathology terms; however, it lacks the rich context-defining relationships provided by ontology [24]. The Clinical Bioinformatics Ontology (CBO) addresses this gap by providing uniquely identified concepts related to clinically significant molecular findings. The CBO consists of nearly 7,000 concepts, each of which is associated with a global unique identifier, and is associated with more than 15,500 relationships [25-27]. As the efforts to integrate two different domains are increased, vocabulary issues will be essential.

Privacy

Individual genomic data are easily identifiable and can be used to predict current and future health status. Thus, extracting knowledge from large health data employs a significant risk of privacy information breach; thus, researchers need to consider Health Insurance Portability and Accountability Act (HIPAA) and Institutional Review Board (IRB) requirements for building a privacy-preserving and trustworthy database infrastructure and conducting research [28]. When personal genetics data can be incorporated into EMR systems, different safeguards to ensure confidentiality will be required. A de-identified bio-data warehouse combining traditional clinical and genomic information will be essential to conduct translational research.

Era of Open Global Collaboration

One major difference between the EMR community and BI community is the degree of multidisciplinary open collaboration. In EMRs, multidisciplinary collaborative efforts have been limited. Most of the EMR systems in use today are based on proprietary software that hampers data exchange with other systems. In the BI community, global collaboration, aided by wide use of open source software and development methodologies, has been a key success factor. The BI community has been global in scope from the beginning, and information sharing and free exchange of software tools have ushered in the era of open source software in health informatics. There is a great need for the EMR community to adopt more open source software methodologies that would allow rapid global collaboration (http://www.osehra.org).

Conclusion

Genomic technologies hold the potential to improve the diagnosis and treatment of inherited and complex diseases-including cancer- and facilitate the move towards personalized predictive medicine. The higher throughput and rapidly falling costs of next-generation sequencing have resulted in voluminous genomic data and downstream computational challenges. Thus, the shift from this powerful discovery research to clinical implementation can only be accomplished with careful integration with EMRs, a frontline patient care tool. The most prominent reason to integrate clinical information and biology information under the same system is to provide opportunities for bi-directional exchange of data, technology, and knowledge between two disciplines with different histories and cultures. Additionally, open global cooperation will provide opportunities to make rapid progress in understanding, treating, and preventing human diseases.