### Introduction

### Methods

*J*

*with the kernel when the variable is transformed. Fig. 1B visualizes when to apply KDE to genomic data to obtain MI.*

_{T}### Definition of entropy and MI

*f*(

*x*)

**,**is known, the entropy,

*H*, is defined in the integral form of the pdf as below, which is also called the differential entropy [22].

*H*(

*P*) is the amount of information contained in the phenotype distribution [10]. The conditional entropy

*H*(

*P*|

*G*) measures the amount of information still necessary to describe the phenotype distribution when the genotype is known. Equivalently, it is the amount of information that the phenotype distribution does not share with the genotype. Therefore, MI in Eq. (2) quantifies the amount of information that the phenotype and genotype distributions share. The more information they share, the more strongly the genetic information contributes to the phenotype.

### Entropy by KDE

*f*(

*x*) from the data [22]. Let {

*X*

*} denote the set of random samples drawn from a distribution with density*

_{i}*f*. Then, the entropy

*H*can be estimated as follows.

*f*(i.e., a pdf). For that purpose, KDE can be used to estimate

*f*. A simple and known function

*K*, called a kernel, may be defined around each data point and summed for the estimation of a pdf, as shown below [23].

*K*(

*u*) should be non-negative and symmetric for our purpose, while satisfying the normalization condition. The requirement for a kernel function that it should be normalized in the range of its argument also ensures the normalization of the pdf [20]. It should be noted that at an arbitrary point

*x*, the pdf is determined by the sum of

*n*individual kernel functions whose centers are at

*x*=

*X*

*. The width of a kernel function is controlled by the bandwidth*

_{j}*h*.

### Kernels for a distribution with a boundary

*x*space, whose range is (−∞, ∞). The different Jacobian between Eqs. (4) and (5) should be noted; this can be obtained straightforwardly from the fact that the normality of the kernel function is defined as below.

*x*space, the estimated pdf fits better, as shown in Fig. 2D. The kernels will not be symmetric in

*x*space, and the shape will be dependent on the data point

*X*

*, around which the kernels are estimated.*

_{j}### Choice of the kernel function

### Determination of bandwidth

*h*should be determined to make arguments for kernel functions. It also plays the role of a weight factor for the sum of kernels at each point. The value of bandwidth can be deduced by setting the derivative of the asymptotic mean integrated squared error with respect to the bandwidth as zero [26]. However, it has a differentiation term of the pdf, which is obviously unknown. An acknowledged replacement is the sample standard deviation,

*σ̂*, and a constant specific to the kernel used [26]. Its expression is as follows, where

*n*is the number of data points.

### MI by entropy and conditional entropy

*P*, can be estimated as follows.

##### (10)

*P*|

*G*=

*g*}. Let

*g*indicate each genotype and

*d*be used for the order of genomic interaction. Because each SNP has three different forms (AA, Aa, and aa),

*d*-order interacting SNPs should have 3

*possible genotypes. The conditional entropy can now be obtained by summing the above KDE calculations on each subset, weighted by the subset size, as below.*

^{d}##### (11)

*i*-th phenotype data in subset {

*P*|

*G*=g} by

### Estimation of p-values

### Results

### Generation of simulation data

*t*

*, were tabulated for each of the nine possible genotype combinations of two interacting SNPs, along with specified values of the minor allele frequency (MAF) and heritability. To generate quantitative values, we took the penetrance as the mean of the distribution from which the trait value was sampled. Three types of distributions were considered. The first type was a normal distribution, as given below.*

_{ij}*t*

*, was used in the distribution functions above, while σ remained a free parameter. When the penetrance,*

_{ij}*t*

*, was larger or smaller than the overall average value, the class of the samples for the genotype*

_{ij}*ij*was assigned as high or low risk, respectively. To simulate various situations, three distinct values of σ, (0.8, 1.0, 1.2), were assigned for high- and low-risk subgroups, independently establishing nine different cases. To further investigate the trait distribution, a third type of trait value sampling was done from a mixed form as shown below, with α set to 0.2.

*t*

*was found to be larger than the overall average (i.e., a high-risk case). There were also nine combinations of σ. The number of SNPs was taken as 20 with a single causal pair and 400 samples. The Velez model has seven heritability values, each of which has five different penetrance tables for two different MAFs. For each of those 70 models, along with nine σ combinations, 100 simulations were conducted, yielding 70×90×100 files for the three distribution schemes, respectively. In all, for each of the seven heritability values, simulated datasets generated from 10 models (five penetrance tables and two MAFs), with nine variations in high- and low-risk samplings from the three types of distributions were considered.*

_{ij}### Demonstration of MI

*H*(

*P*) and

*H*(

*P*|

*G*) values, as defined in Eq. (2). Compared to the unassociated MI, the MI for the associated pair was found to be quite large. Their distinction was clear. The Epanechnikov kernel yielded a closer MI to the analytic result, which should be very close to the true value, than the Gaussian kernel.

### Comparison of hit ratios

*t*

*in that equation is penetrance, which should be smaller than 1, the resultant gamma distribution would have a shape parameter,*

_{ij}*k*, smaller than the scale parameter, θ, in most cases because of the used σ values. This condition results in a quite skewed gamma distribution, as intended, giving rise to the boundary effect. KDE, as designed, showed consistency and better performance than m-spacing, QMDR, and GMDR, with the exception mentioned above, regardless of the distribution shapes.