Beta-Meta: a meta-analysis application considering heterogeneity among genome-wide association studies
Article information
Abstract
Many packages for a meta-analysis of genome-wide association studies (GWAS) have been developed to discover genetic variants. Although variations across studies must be considered, there are not many currently-accessible packages that estimate between-study heterogeneity. Thus, we propose a python based application called Beta-Meta which can easily process a meta-analysis by automatically selecting between a fixed effects and a random effects model based on heterogeneity. Beta-Meta implements flexible input data manipulation to allow multiple meta-analyses of different genotype-phenotype associations in a single process. It provides a step-by-step meta-analysis of GWAS for each association in the following order: heterogeneity test, two different calculations of an effect size and a p-value based on heterogeneity, and the Benjamini-Hochberg p-value adjustment. These methods enable users to validate the results of individual studies with greater statistical power and better estimation precision. We elaborate on these and illustrate them with examples from several studies of infertility-related disorders.
Introduction
Genome-wide association studies (GWAS) of diseases and traits have increasingly been used to identify single nucleotide polymorphisms (SNPs). Although GWAS have tested hundreds of thousands of genetic variants to discover genotype-phenotype associations, they have a few limitations. Variants discovered in individual GWAS explain only a small proportion of heritability, and their genetic effect sizes are mostly small and require a substantial sample size to identify [1,2]. Moreover, some studies examining the same genotype-phenotype association yield inconsistent results such as variant effect sizes in opposite directions [3,4]. To overcome these limitations, a meta-analysis of GWAS has been used extensively since it can improve the statistical power by combining data across any number of independent studies and can clarify heterogeneity among their results [5].
As meta-analysis has become a popular tool for aggregating data from multiple sources, several studies have revised analytical strategies from previous well-known studies [6-9]. A weighted average of the effect sizes can be calculated under a fixed effects model or a random effects model, but the fixed effects model can lead to false-positive results when there is heterogeneity between studies [9,10]. Even though it is important to use the appropriate approaches for meta-analyses, there are few available tools that provide a step-by-step calculation, running both the fixed effects model and the random effects model [10]. Therefore, for those who find it difficult to conduct a meta-analysis, we have developed a flexible data processing tool that adopts the revised methods assessing heterogeneity between studies and using the Benjamini-Hochberg (BH) procedure to calculate adjusted p-values [11]. In addition to these methods, Beta-Meta has several convenient features such as an automatic selection between the two models depending upon the quantified heterogeneity. It also manifests flexibility and convenience in processing data as it can perform a varying number of meta-analyses simultaneously and operate strand flipping automatically when there is a discrepancy in the direction of the strand orientation between studies. Also, we have attached haploR package [12] which detects alternative SNPs by estimating their correlations.
Since it is crucial to increase statistical power in order to identify significant variants, especially in studies with small sample sizes, we demonstrate Beta-Meta using studies of diseases related to infertility, most of which have relatively small sample sizes [4,13-34].
Methods
Fig. 1 depicts the four steps of Beta-Meta: input data manipulation, heterogeneity test, weighted effect size calculation under the fixed and random effects models, and output data of summary statistics after the BH adjustment.
Linkage disequilibrium calculation
Meta-analysis can improve signal detection when we account for not only between-study heterogeneity but also differences in linkage disequilibrium (LD) between ethnicities [35]; in addition, several trans-ethnic meta-analyses have identified unknown susceptibility genes [35-37]. As it is important to consider differences in LD, we utilize the haploR package [12] that queries HaploReg database [38] and returns alternative SNPs in LD. By calculating pairwise metrics of LD in each continental population, LD structures between ethnicities can be discovered and hence alternative SNPs can be used for the following meta-analysis [38]. This step is optional; users may skip this step and start a meta-analysis when the summary statistics of their target SNPs of interest are already obtained.
Input data manipulation
After surveying the studies of interest (infertility-related disorders in this paper), we created a table for input data in Excel (Supplementary Table 1). Beta-Meta can read an Excel file for input data, which must include phenotypes, SNPs, effect and non-effect alleles, effect sizes, and p-values. For the effect sizes and their levels of significance, either the beta coefficient and its standard error or the odds ratio (OR) and its confidence interval can be used. As Beta-Meta calculates SNP-phenotype associations separately, it is acceptable to include as many phenotypes as desired in the single input file.
When the OR and its confidence interval are used for input data, they are converted into the beta coefficient and the standard error, respectively. The normalized effect of the ithstudy, βi is the logarithm of OR, where k is the number of individual studies, each of which is designed to examine the same SNP-phenotype association [9].
The standard error si is calculated from the 95% confidence interval of the OR.
When synthesizing datasets for meta-analysis, it is important to ensure uniformity in allele labels and hence in the direction of the effect because alleles are typically called on only one of the two DNA strands in sequencing experiments [39]. Beta-Meta automatically corrects the direction of the effect by using one of the datasets with the lowest p-value as a reference and aligning the other datasets to it. For example, when the effect and the non-effect allele are inverted between the independent studies (e.g., rs13405782 and rs1801133 as shown in Table 1), this can be resolved automatically by changing the sign of the normalized effect.
Heterogeneity analysis
In meta-analysis, datasets generated by multiple groups by different methods are likely to have any kind of variability, also known as heterogeneity. Heterogeneity indicates that the observed effects in datasets are more different from each other than would be expected by random error alone [40]. To check the heterogeneity, the weighted average of the effect size
Then, we calculate the Cochran’s Q statistic, Q and Higgins’ heterogeneity metric, I2 for the heterogeneity test [6].
I2 quantifies the degree of heterogeneity as a value between 0 and 100% [41]. As a greater value of I2 indicates stronger heterogeneity, the weighted average of the effect sizes is calculated, based on I2, using two different models: the fixed effects model and the random effects model. A threshold value of I2 for the model selection is set to 50%.
Calculation of weighted average of the effect sizes based on I2
For 0≤I2<50, we use the fixed effects model to calculate the weighted average of the effect sizes and its standard error [7].
For 50≤I2≤100, we use the random effects model [7,9].
The weights for the random effect model wiR are as follows [7,9]:
Integrated p-value and the BH adjustment
The integrated p-value through meta-analysis can be obtained as follows [7]:
where Φ is the cumulative distribution function of the standard normal distribution, and integrated Z-score, Z [7] is
Finally, to reduce the false-positive results, the integrated p-values are corrected by the BH adjustment method. When p(1), p(2), ⋯,p(m) are the p-values of the SNPs sorted in ascending order (p(1) ≤ p(2) ≤ ⋯ ≤ p(m)), the adjusted p-values obtained through the BH procedure are as follows [11]:
where m is the number of different SNPs related to a specific phenotype, and j is the ranking in the ascending order of the p-values of SNPs related to the specific phenotype.
Results
Using Beta-Meta, we performed a sample test of integrating multiple studies of infertility and obtained a table containing all of the above calculated summary statistics values (Supplementary Table 2) and a forest plot of combined effect sizes (Supplementary Fig. 1). The conventional genome-wide significance p-value threshold of 5 × 10-8 was used to identify significant SNP markers. Of the total 26 SNP-phenotype associations from the 23 studies we investigated (Supplementary Table 1), the only significant association was the one between rs10965232 and endometriosis from Uno et al. [14] with a p-value of 5.57 × 10-12 (Table 1). After performing the meta-analysis, we found three more significantly associated SNPs: rs13405728, rs1801133, and rs10842262 as displayed in Table 2.
In order to check the accuracy of Beta-Meta, we compared the meta-analysis results of Beta-Meta (Supplementary Table 2) with those of METAL [8] (Supplementary Table 3), which is one of the most widely used meta-analysis packages but does not have a random effects option. We could confirm the accuracy of Beta-Meta calculation with the result that the significantly associated SNPs identified by METAL and those found by Beta-Meta were the same. At the same time, Beta-Meta features convenience as it calculates the summary statistics accurately by automatically selecting the appropriate model based on heterogeneity.
Discussion
Beta-Meta application can be utilized as an effortless meta-analysis tool for researchers with limited statistics backgrounds. It allows them to easily manipulate and analyze their own datasets on a personal computer as it is written in python and can be run with an executable file in MS Windows.
As shown above, Beta-Meta increases the power to detect weak signals, identifying significant variants which was not significantly associated in single studies. Furthermore, it calculates the effect sizes and the p-values accurately by selecting the appropriate model based on heterogeneity and applying the BH adjustment. These can contribute to time-efficient management of the recent growth in aggregated GWAS especially for those involved in the field of genetic testing. Because it is difficult to obtain a large number of datasets and validate genotype-phenotype associations experimentally within a limited budget, meta-analysis is still in demand to discover SNP markers for genetic testing.
In conclusion, the application presented here provides a conventional and yet convenient way to conduct a meta-analysis of GWAS. Beta-Meta is expected to facilitate various research projects, such as the discovery of novel SNP markers, the calculation of polygenic risk scores, and the acquisition of biological insights into complex diseases and traits.
Notes
Authors’ Contribution
Conceptualization: WL. Data curation: GK. Formal analysis: GK. Methodology: WL, GK. Software: GK. Supervision: DK. Visualization: JHP, GK. Writing – original draft: YL, GK, WL. Writing – review & editing: DK.
Conflicts of Interest
No potential conflict of interest relevant to this article was reported.
Availability
Beta-Meta is written in python 3.9.7, and is available at https://github.com/Gyungbu/beta_meta.
Acknowledgements
This work was supported by National IT Industry Promotion Agency (NIPA) grant funded by the Korea government (MSIT) (No. S0252-21-1001, Development of AI Precision Medical Solution (Doctor Answer 2.0)).
Supplementary Materials
Supplementary data can be found with this article online at http://www.genominfo.org.