Hybrid capture-based targeted sequencing is being used increasingly for genomic variant profiling in tumor patients. Unique molecular index (UMI) technology has recently been developed and helps to increase the accuracy of variant calling by minimizing polymerase chain reaction biases and sequencing errors. However, UMI-adopted targeted sequencing data analysis is slightly different from the methods for other types of omics data, and its pipeline for variant calling is still being optimized in various study groups for their own purposes. Due to this provincial usage of tools, our group built an analysis pipeline for global application to many studies of targeted sequencing generated with different methods. First, we generated hybrid capture-based data using genomic DNA extracted from tumor tissues of colorectal cancer patients. Sequencing libraries were prepared and pooled together, and an 8-plexed capture library was processed to the enrichment step before 150-bp paired-end sequencing with Illumina HiSeq series. For the analysis, we evaluated several published tools. We focused mainly on the compatibility of the input and output of each tool. Finally, our laboratory built an analysis pipeline specialized for UMI-adopted data. Through this pipeline, we were able to estimate even on-target rates and filtered consensus reads for more accurate variant calling. These results suggest the potential of our analysis pipeline in the precise examination of the quality and efficiency of conducted experiments.
The development of next-generation sequencing (NGS) has brought remarkable growth in our understanding of human genome variants through comprehensive characterization. On top of that, various efforts have been made to find associations of these understandings with diseases, including cancers [
Particularly, detecting various mutations, including somatic mutations, is essential to comprehend the cancer genome [
However, the next predicament is the analysis of the targeted sequencing data, since the variants revealed in the data are hard to discriminate from false-positive errors. In detail, innate sequencing errors and early cycle polymerase chain reaction (PCR) biases during library amplification or target enrichment could be considered super or rare mutations during the variant calling process. To overcome this erroneous algorithm, unique molecular index (UMI) technology was recently developed [
In this study, to build a versatile analysis pipeline to apply to various types of hybrid capture-based targeted sequencing, our laboratory evaluated various types of analysis tools, like Fulcrum genomics (fgbio,
Tumor tissues were dissected from 6 different colorectal cancer patients. Tumor samples were labeled from 1T to 7T, and sample 4T was omitted. NA24385 of the HapMap project and AccuRef Quan-Plex NGS Reference Standard Genomic DNA (cat# ARF-1001G-1; AccuRef, Milpitas, CA, USA) were used as positive controls.
For hybrid capture, all coding sequences of 46 genes and non-coding sequences of some of those genes were targeted. Pre-designed and customized probe sets were manufactured by Integrated DNA Technologies (IDT, Coralville, IA, USA).
To analyze UMI-adopted targeted sequencing data, we generated targeted sequencing libraries. During library generation, UMI sequences were integrated, along with the P5 sequencing adapter. Eight libraries were pooled together into one 1.7-mL microtube, and this 8-plex pooled library was then hybridized using IDT xGen LockDown pre-designed/custom probes. Targeted sequencing data of hybrid-capture library were generated using HiSeq series by 150-bp paired-end sequencing.
Using BclToFastq, the .bcl file was processed and divided into fastq files of read 1 (R1), read 2 (R2), and UMI. To generated unmapped bam files, R1.fastq and R2.fastq files were processed using FastqToBam (Fulcrum Genomics/fgbio, v0.7.0) (
UmiAwareMarkDuplicateWithMateCigar (Picard) counted duplicates among the raw reads to estimate duplicate rates. As this tool is still under development, we recommend a regular check with the developers on further validation. To calculate on-target rates (%), CollectHsMetrics (Picard) is used to generate HsMetrics. By expanding 100 bp of chromosome coordinates (start/end site) in target.bed, on-target %, including flanking regions, was calculated using the same tool. To calculate the read coverage, the following equation was used: Coverage = Read length(bp) × The number of reads/Genome size (bp). According to the manufacturer, the genome size refers to the target size of the probes.
First, total read counts were counted by either Picard tools or fgbio tools. An 8-plexed hybrid capture library generated 100 Gbp data, and approximately 15 Gbp data were generated from each sample. Considering the fact that the target genome size is 0.3 Mbp, the read coverages were calculated as 50,000× to 70,000×, as shown in
Before removing sequence duplicates, we estimated duplicate read counts via UmiAwareMarkDuplicateWithMateCigar (Picard). We could measure duplicate rates of >70% of the reads (
We then estimated the number of consensus reads to be ready for more accurate variant calling. Through this filtering, we were able to get the consensus coverage down to one-sixth of the raw coverage (
In this study, we built an analysis pipeline for targeted sequencing data generated based on the hybrid-capture method. As UMI-adopted targeted sequencing data are notorious for their novelty and complexity, we mainly focused on finding tools and optimizing methods for the analysis.
Our data show an on-target rate of 37%, and there is little variation among samples. This may suggest that almost equal amounts of libraries of each sample were used for hybridization. However, because 37% is relevantly low for hybridization efficiency, the estimation of the on-target rates shows that still there is room for improvement.
The reason why on-target rate is important is that by estimating on-target rates and further upgrades of protocols, the quality of targeted hybrid-capture sequencing could be improved. In detail, protocol modifications during hybridization steps could possibly bring about an increase of on-target rates. For example, a slightly excessive amount of input DNA or target probes can increase the off-target effects during hybridization. Furthermore, inconsistent temperature or slightly higher/lower temperature than the proper temperature could bring about larger off-target effects than expected.
With respect to the pipeline, when using CallMolecular ConsensusReads, insertion and deletion errors are not considered in the consensus model. Therefore, realignment steps using other methods, such as IndelRealigner (GATK, v4.0.2.1), should be integrated for better and more precise analysis for identification of short indels. Furthermore, even though we optimized the tools and customized python codes for analyzing UMI data, there are still many tools that could be used in one analysis pipeline. We are still trying to minimize irrelevant steps to simplify the process.
In summary, we have built an analysis pipeline specialized for UMI-adopted hybrid-capture-based data. Given the fact that the precision medicine era has been coming lately and that targeted sequencing and UMI technology help to comprehensively understand the genomewide status of cancer patients, this report suggests that the quality of the experiment can be examined precisely and efficiently by this pipeline, and our laboratory sees its positive potential in being widely used for studies in various clinical fields.
This research was supported by the Bio & Medical Technology Development Program of the National Research Foundation (NRF), funded by the Ministry of Science & ICT (grant number: NRF-2017M3A9A7050614).
No potential conflicts of interest relevant to this article was reported.
Conceptualization: MJK, YJK
Data curation: MJK
Formal analysis: MJK, SCK
Funding acquisition: YJK
Methodology: MJK, SCK, YJK
Writing – original draft: MJK
Writing – review & editing: MJK, YJK
Analysis pipeline. Tools and methods used in different file generation (fastq to input files for variant calling) are shown in the flow chart.
Grouped histogram of on-target rate (%). On-target rate was expressed using the stacked histogram. The x-axis shows 6 tumor samples and one positive control. The upper part of each bar shows on-target % when the target region is enlarged with additional 50-bp flanking parts on both sides of each region.
Duplicate rate. Duplicate read ratio of each sample. Two positive controls are shown in the first two samples. The average duplicate rate is 0.73.
Comparison of sequencing coverage from raw reads to filtered reads. The read coverage calculated from the read counts are shown with bar plots.
Read counts and on-target rates (%)
Samples | Total reads | Raw coverage | On-target (%) | On-target (%)(FLANK) | On-target coverage | On-target coverage (FLANK) |
---|---|---|---|---|---|---|
NA24385 | 111,609,098 | 55,805 | 36.3 | 49 | 20,257 | 27,344 |
AccuRef | 105,917,538 | 52,959 | 37.2 | 49.9 | 19,701 | 26,427 |
1T | 101,998,476 | 50,999 | 36.3 | 48.8 | 18,513 | 24,888 |
2T | 115,406,592 | 57,703 | 36.2 | 48.7 | 20,888 | 28,101 |
3T | 139,988,838 | 69,994 | 37 | 49.7 | 25,898 | 34,787 |
5T | 120,756,350 | 60,378 | 37 | 49.7 | 22,340 | 30,008 |
6T | 110,765,272 | 55,383 | 37 | 49.7 | 20,492 | 27,525 |
7T | 109,743,358 | 54,872 | 37 | 49.7 | 20,303 | 27,271 |
Total raw reads and calculated coverage of raw data are shown in the first and second columns of the table, respectively. After examining on-target (%) with or without flanking regions, on-target coverage values were calculated in the last two columns. NA24385 and AccuRef were used as positive control samples.