Genetic associations have been quantified using a number of statistical measures. Entropy-based mutual information may be one of the more direct ways of estimating the association, in the sense that it does not depend on the parametrization. For this purpose, both the entropy and conditional entropy of the phenotype distribution should be obtained. Quantitative traits, however, do not usually allow an exact evaluation of entropy. The estimation of entropy needs a probability density function, which can be approximated by kernel density estimation. We have investigated the proper sequence of procedures for combining the kernel density estimation and entropy estimation with a probability density function in order to calculate mutual information. Genotypes and their interactions were constructed to set the conditions for conditional entropy. Extensive simulation data created using three types of generating functions were analyzed using two different kernels as well as two types of multifactor dimensionality reduction and another probability density approximation method called m-spacing. The statistical power in terms of correct detection rates was compared. Using kernels was found to be most useful when the trait distributions were more complex than simple normal or gamma distributions. A full-scale genomic dataset was explored to identify associations using the 2-h oral glucose tolerance test results and γ-glutamyl transpeptidase levels as phenotypes. Clearly distinguishable single-nucleotide polymorphisms (SNPs) and interacting SNP pairs associated with these phenotypes were found and listed with empirical p-values.

Over the past decades, genetic association studies have been conducted to identify genetic variants associated with various traits or diseases [

Multifactor dimensionality reduction (MDR) has been successfully used as a genomic association measurement method [

Entropy-based methods of analyzing genomic associations have emerged as another stream of research [

Here, we propose another way of analyzing genomic associations for quantitative traits based on the kernel density estimation (KDE). KDE estimates a distribution function by summing kernels over the domain, or the observed data points. Kernels are designed to be normalized and non-negative functions, symmetric around each data point [

MI between the genotype and the quantitative phenotype is investigated to establish a genomic association. Measuring MI requires estimating the entropy and conditional entropy. To estimate them for a quantitative trait, the probability density function (pdf) needs to be estimated first. KDE has been adopted to estimate the pdf for distributions with or without a boundary effect. _{T}

When the probability density function,

MI is defined as the difference between the entropy of one set and that conditioned by the other set, where two sets are interchangeable. MI can quantify the association between two sets [

, where

To estimate the entropy in (_{i}

The estimation of entropy now becomes equivalent to the estimation of

Here _{j}

Some phenotype distributions have distinct boundaries. For example, let us examine the phenotype of γ-GTP levels, as shown in

Here, the kernel is symmetric in ln

Transforming back to _{j}

Several types of kernel functions have been proposed that satisfy the symmetric and non-negative conditions imposed for our purpose [

The indicator function, 1(..), is used. Meanwhile, the Gaussian kernel has about 5% lower efficiency, which means it requires 5% more data points to achieve the same error level as the Epanechnikov kernel. However, the Gaussian kernel is widely used because of its mathematical convenience. It has the form given below.

The Epanechnikov kernel is also more advantageous for computation due to its relative simplicity, which becomes an important factor with genomic data containing an extensive set of genotypes [

As can be seen in

The bandwidth in _{2}=2.34, 1.06 for the kernels in

Combining

For computing the conditional entropy, the phenotype set needs to be divided according to the corresponding genotypes, represented as {^{d}

We denote the

If statistical significance of the obtained MI is required, the p-value is estimated by random permutation of the trait values among samples to make the resultant dataset satisfy the null hypothesis. The maximum MI value of all genotype combinations from this dataset would form a single point of the null distribution of MI constructed by repeated random permutations. Counting the number of points in this null distribution that are larger than or equal to the observed MI would give the desired empirical p-value.

The application of KDE to a genomic association study and its performance were examined with a simulated dataset. Simulated data were generated based on the Velez models [_{ij}

Another was a gamma distribution, shown below.

It should be noted how the penetrance, _{ij}_{ij}

The high-risk term in _{ij}

In

KDE is also a non-parametric method, as is m-spacing. With a symmetric distributions in _{ij}

To examine the type I error rate, the same process used to build the simulation dataset was adopted to construct the null dataset, except that no causal pairs were intended. With the null dataset, the empirical p-value was evaluated by permuting the phenotype part 1,000 times. The p-value evaluation was repeated with the entire null dataset. Counting the number of instances in which the p-value obtained turned out to be smaller than the significance level, which was taken as 0.05, indicates the type I error.

A genome-wide dataset from the KARE project [

The γ-GTP distribution was found to be skewed enough to be regarded as the gamma distribution. Therefore, an analysis was performed by KDE with the boundary effect considered.

We investigated genomic associations with quantitative traits, including genomic interactions. Entropy-based MI can measure the association strength if the entropy of the trait could be estimated both by itself and as conditioned on the genotypes. We estimated entropy through KDE.

We explored and compared two types of kernel functions for KDE. The Epanechnikov kernel involves a far lower computational burden than the Gaussian kernel, but it was found to be as powerful as the Gaussian kernel for the genomic association task. There are several other kernels whose efficiencies lie between the Epanechnikov and Gaussian kernels, but under the non-negativity and symmetry constraint, their shapes are quite similar, especially in that their extents are limited by the indicator function, unlike the Gaussian kernel. Therefore, the two kernels investigated may lie at two extremes in terms of efficiency and how they are defined. Other kernels are expected to provide similar results.

When the dataset is made from a skewed distribution with a crowded boundary, using a symmetric kernel inherently leads to an extended tail outside the supported range. A consequence is an incorrect estimation of the association. The real data for γ-GTP, which we reported in the present analysis, may not be correctly analyzed with a usual symmetric kernel. We suggested defining a transformed argument in the kernel to confine the sum of the kernel functions within the supported range. Through these tactics, the hit ratios were found to be stable and superior to those from other methods.

The proposed method can be extended to multivariate phenotype traits, while m-spacing is intrinsically a univariate method. Multivariate traits should be the natural extension of this paper. When the real data are expected to be more complex, beyond a dichotomous classification, our method in this paper would therefore be a legitimate candidate. Phenotypes with more than one threshold can be found, one of which is the OGTT-2h phenotype analyzed here. Simultaneous associations of SNPs were found with the phenotypes that have been suggested to have OGTT-2h-related traits as a prognostic factor. Therefore, these SNP findings may provide additional evidence for the reported pathways. This might be a benefit of analyzing quantitative traits in their original form.

Conceptualization: MP. Data curation: TP. Formal analysis: JY. Funding acquisition: MP. Methodology: JY, TP, MP. Writing - original draft: JY. Writing - review & editing: JY, TP, MP.

Taesung Park serves as an editor of the Genomics and Informatics, but has no role in the decision to publish this article. All remaining authors have declared no conflicts of interest.

This research was supported by a National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT) (NRF-2021R1A2C1007788).

Flow charts for the kernel density estimation (KDE) (A) and the mutual information (MI) (B). Entropy, H, can be estimated from a dataset, {Xi}, sampled from a distribution of density f. Transformed kernel should be used when the boundary effect is not negligible. MI can be obtained by applying KDE to phenotype (P) and genotype (G) data and then combining the results.

Estimation of density for the distributions with boundary. Gamma distribution, which has boundary at 0, is shown (A). Histogram of γ-glutamyl transpeptidase (γ-GTP) that follows such distributions is plotted (B). Kernels in x-space would always estimate a tailed density outside the boundary as shown (C), which should lead to the estimation of entropy for the more diffused distribution. Corrected density (D) fits better without crossing the boundary.

Demonstration of the association strength of a simulated genomic data obtained by kernel density estimation. Length of the vertical line between the paired points of H(P) and H(P|G) represents the association strength measure by mutual information.

(A–C) Comparison of the hit ratios. Correct detection rates of the causal pair are compared with other method with respect to the heritability. Plots are separated by the generation schemes of the simulation data examined. KDE, kernel density estimation; QMDR, quantitative multifactor dimensionality reduction; GMDR, generalized multifactor dimensionality reduction.

Scree plots of the associations. Top associated main effects of a single nucleotide polymorphism (SNP) (A, B) and 2-order interacting SNPs (C, D), for the phenotypes of 2-hour oral glucose tolerance test (OGTT-2h) and γ-glutamyl transpeptidase (γ-GTP), respectively. MI, mutual information.

Type I error estimation with a significance level (

Type I error rate (%) | Normal | Gamma | Mixed |
---|---|---|---|

MAF | |||

0.4 | 5.0 | 5.3 | 4.9 |

0.2 | 4.7 | 5.7 | 5.0 |

Heritability | |||

0.4 | 4.8 | 4.8 | 5.3 |

0.3 | 4.9 | 4.9 | 4.6 |

0.2 | 4.6 | 5.6 | 5.0 |

0.1 | 4.7 | 5.0 | 5.1 |

0.05 | 5.0 | 5.7 | 4.9 |

0.02 | 4.7 | 5.8 | 5.2 |

0.01 | 5.0 | 5.8 | 4.6 |

Overall | 4.8 | 5.4 | 5.0 |

MAF, minor allele frequency.

Main effect found by KDE for OGTT-2h with KARE samples

Rs ID | Chromosome | MI | p-value | Reference |
---|---|---|---|---|

rs1559347 | 16 | 0.0069 | 2 × 10^{−5} |
- |

rs2055918 | 4 | 0.0066 | 3 × 10^{−5} |
- |

rs30500 | 5 | 0.0064 | 4 × 10^{−5} |
[ |

rs12983584 | 19 | 0.0062 | 4 × 10^{−5} |
- |

rs4338946 | 2 | 0.0061 | 4 × 10^{−5} |
- |

rs10968001 | 9 | 0.0059 | 4 × 10^{−5} |
- |

rs6919172 | 6 | 0.0058 | 4 × 10^{−5} |
- |

rs2227311 | 13 | 0.0057 | 4 × 10^{−5} |
[ |

rs7468639 | 9 | 0.0057 | 4 × 10^{−5} |
- |

rs16898812 | 5 | 0.0057 | 4 × 10^{−5} |
- |

rs3780603 | 9 | 0.0055 | 4 × 10^{−5} |
[ |

rs41417552 | 5 | 0.0055 | 4 × 10^{−5} |
[ |

KDE, kernel density estimation; OGTT-2h, 2-hour oral glucose tolerance test; KARE, Korean Association Resource; MI, mutual information.

Interactions found by KDE for OGTT-2h with KARE samples

Rs ID pair | Chromosome | MI | p-value |
---|---|---|---|

(rs30500, rs1559347) | (5,16) | 0.0153 | 1 × 10^{−5} |

(rs16898812, rs30500) | (5,5) | 0.0140 | 1 × 10^{−5} |

(rs2055918, rs30500) | (4,5) | 0.0138 | 1 × 10^{−5} |

KDE, kernel density estimation; OGTT-2h, 2-hour oral glucose tolerance test; KARE, Korean Association Resource; MI, mutual information.

Main effect found by KDE for γ-GTP with KARE samples

Rs ID | Chromosome | MI | p-value | Reference |
---|---|---|---|---|

rs6990124 | 8 | 0.0309 | 1 × 10^{−5} |
- |

rs2074356 | 12 | 0.0120 | 3 × 10^{−5} |
[ |

rs11066280 | 12 | 0.0117 | 4 × 10^{−5} |
[ |

rs4604857 | 11 | 0.0105 | 1.1 × 10^{−4} |
- |

rs16872439 | 8 | 0.0097 | 2.3 × 10^{−4} |
- |

rs9522473 | 13 | 0.0096 | 2.4 × 10^{−4} |
- |

rs2227311 | 13 | 0.0091 | 3.7 × 10^{−4} |
- |

rs16875527 | 4 | 0.0083 | 7.8 × 10^{−4} |
- |

rs663661 | 10 | 0.0081 | 8.7 × 10^{−4} |
- |

rs398182 | 22 | 0.0080 | 9.5 × 10^{−4} |
- |

rs12229654 | 12 | 0.0075 | 1.42 × 10^{−3} |
[ |

KDE, kernel density estimation; γ-GTP, γ-glutamyl transpeptidase; KARE, Korean Association Resource; MI, mutual information.

Interactions found by KDE for γ-GTP with KARE samples

rs ID pair | Chromosome | MI | p-value |
---|---|---|---|

(rs2211730, rs6990124) | (8,8) | 0.0392 | 1 × 10^{−5} |

(rs314743, rs6990124) | (5,8) | 0.0389 | 1 × 10^{−5} |

(rs6990124, rs11103291) | (8,9) | 0.0389 | 1 × 10^{−5} |

KDE, kernel density estimation; γ-GTP, γ-glutamyl transpeptidase; KARE, Korean Association Resource; MI, mutual information.