^{1}

^{*}

^{2}

Various methods of frequent pattern mining have been applied to genetic problems, specifically, to the combined association of two genotypes (a genotype pattern, or diplotype) at different DNA variants with disease. These methods have the ability to come up with a selection of genotype patterns that are more common in affected than unaffected individuals, and the assessment of statistical significance for these selected patterns poses some unique problems, which are briefly outlined here.

Frequent itemset (or pattern) mining (FPM) is now a well-established field with a rich literature and availability of software [

It has been known that many human traits follow Mendelian mode of inheritance and are passed from parent to offspring in a dominant or recessive manner [

It has been well recognized that DNA contains the information necessary for the development and functioning of organisms, and this information is contained in the sequence of its building blocks, nucleotides. For example, transcription factors (TFs) are proteins that can modulate the activity of these genes by binding to specific nucleotide sequences (binding sites) in or near genes [

Recently, the field of FPM is rapidly emerging [

We previously exploited FPM methodology in genome-wide disease association testing [

Many methods have been developed to make use of FPM approaches for detecting pairs of genotypes underlying digenic traits, or longer patterns of genotypes influencing polygenic traits. Here, we focus on practical ways of establishing whether genotype patterns found by such methods are statistically reliable.

Most data contain errors and possibly missing data. Various approaches to handling the latter problem have been discussed [

Classical FIM methods are designed to find

Last but not least, sets of genotypes from DNA variants can be used for individual identification. For example, support vector machines (SVM) and random forest (RF) methods have been applied to mitochondrial DNA for identifying relatives of individuals who died in accidents [

A well-known example of individual identification is that of the Kennewick Man, a human skeleton discovered in the American state of Washington, dated to have lived 9,000 years ago [

A recent example of relationship identification refers to Sitting Bull, the legendary Lakota Sioux leader [

Among many FPM methods, multifactor dimensionality reduction (MDR) has been widely used for detecting epistasis [

Since its introduction of MDR, many extensions of MDR have been proposed. For categorial traits, log-linear models MDR was proposed using log-linear models [

A multivariate version of MDR has been proposed to treat multiple phenotypes simultaneously. For example, obesity can be measured through body mass index, weight, and hip ratio. Multivariate generalized MDR (GEE-GMDR) can simultaneously address these obesity-related phenotypes by constructing generalized estimating equation models [

Classical statistical methods like discriminant analysis can classify data into two or more groups based on a possibly large number of input variables but they do this in a predictable manner, that is, with the aid of a clearly defined model. Machine learning methods improve their performance with experience [

One of the earliest applications of ANNs in human genetics was to find patterns of genetic loci that would discriminate between two phenotypes, affected versus unaffected with a complex genetic trait [

An SVM is another machine learning algorithm that is generally applied for separating data into two groups based on input data related to these groups. This is accomplished by the construction of a hyperplane that best separates the two groups [

In the RF method, multiple classification trees are grown, which then “vote” on the best classification or prediction. An interesting comparison between the logistic regression and RF method was carried out in a retrospective study on 505 children receiving chemotherapy and had developed febrile neutropenia [

Another, much larger study applied RF analysis to compare 56 risk/protective factors for depression in a sample of 67,603 European older adults [

Consider a number N of individuals, each genotyped at a possibly large number of DNA variants. For each variant, a given individual has two alleles numbered 1 and 2 (or 0 for unknown), which are conveniently translated into three genotypes numbered 1 = (1, 1), 2 = (1, 2), and 3 = (2, 2), where (

Any method for finding genotype patterns associated with disease will furnish a list of patterns, each with observed values for support ^{2}, and its associated nominal empirical significance level, ^{2} or larger just by chance, that is, assuming no association between X and Y. We generally want to find results with an associated very small

Here we want to shed light on questions on multiple testing in genotype pattern mining (GPM) for case-control association analysis of digenic traits. For statistical details, the reader is referred to published reviews [

Assume now that we have obtained a number, _{i},

The Bonferroni method controls the so-called family-wise error rate (FWER) by declaring _{i} < α/_{i}^{B} = min(_{i}, 1) that are valid for any dependence among p-values, although the Bonferroni correction tends to be conservative for strongly correlated p-values [

Researchers often search for patterns with high confidence, and such patterns are highly likely to furnish large chi-squares. In other words, results are biased in favor of large chi-square values. The best remedy is to relax the selection procedure as follows. Observed confidence without any disease association is given by the proportion of cases in the data, that is, by _{0} = _{2}/(_{1} + _{2}), where _{1} and _{2} are the numbers of controls and cases, respectively. Thus, our search criteria should impose a minimum confidence of _{0} rather than the usual 80% or 90%, which will guarantee that the full range of p-values from 0 through 1 will be exhibited. In practice, this will often lead to very large numbers of patterns, which is the price we pay for applying a Bonferroni-type multiple-testing correction. As will be seen in our practical examples, we may not need to rigorously apply a minimum confidence of _{0} and working with the Bonferroni correction turns out to be rather reasonable.

Various approaches have been taken to improve on the basic Bonferroni approach. For highly interrelated tests, an early suggestion by Tukey [

At this point it is worth mentioning that the practice of working with a minimum confidence will lead to one-sided statistical tests, that is, we only consider patterns more common in cases than controls. We may reverse the pattern search by looking for patterns more common in controls than cases, or apply other criteria for selecting patterns. But here we focus on the common practice of restricting patterns to those with high confidence.

As we have seen, for the Bonferroni method to provide valid results, we should consider the whole range of p-values from small to large, effectively comparing frequencies of small and large p-values. For example, without any effect of genotype patterns on disease (called null hypothesis), the proportion of patterns with p < 0.10 should be the same as those with p > 0.90, namely, 10%. Another type of comparison may be performed by creating null data on the computer, that is, data known not to contain disease association, and comparing these with the observed data. Null data may be obtained by randomly scrambling labels “case” and “control”, which clearly gets rid of any association between genotypes and phenotypes, whereas the genotype data are left untouched. A large number Np of permutation datasets are created on the computer, with each being subjected to the same analysis as done on the observed data, and the _{i} (here, chi-square), obtained for each such null dataset is recorded. Then, for each chi-square obtained in the observed data, the proportion of _{i} values larger than or equal to the observed chi-square is an estimate for the

The main advantage of permutation testing is that a potentially small number of highly selected genotype patterns (high values for minimum confidence) may be considered and their permutation-based

The methods discussed so far control the so-called FWER [_{0} = _{2}/(_{1} + _{2}). The p-values are ranked from small (_{1}) to large (_{m}), and the largest _{r} < _{r} are considered discovered [

The idea underlying the FDR has been formulated as a partition test [

For some datasets, the total number of variants may be too large for pattern mining approaches. Some authors then select a subset of variants, often those

As a case in point, we apply our GPM approach [

GPM analysis was carried out with minimum support of 40 digenic genotype patterns. The “null” confidence was 96/223 = 43%, but to keep the number of patterns to a manageable level, we were working with a minimum confidence of 60%, which resulted in a total of 18,044,794 genotype patterns. Permutation analysis was carried out with 1,000 replicates.

Results for the Bonferroni correction are comparable to those for FDR-BY and furnished more patterns than permutation analysis. Thus, at least on the basis of this example, the Bonferroni correction is a reasonable way to go although, to be valid, it requires a rather low minimum confidence with resulting large number of patterns. Despite the rather large total number of genotype patterns and resulting strong penalty, the Bonferroni correction exhibited multiple significant genotype patterns. Permutation analysis provided much smaller number of significant results. However, it is not generally true for most cases. Depending on the number of replicates, the permutation analysis may provide slightly different results.

Pairwise analyses may be carried out at the level of genotype, variant [

As seen in the results displayed in

Conceptualization: JO, TP. Data curation: JO. Formal analysis: JO. Funding acquisition: TP. Methodology: JO. Writing - original draft: JO. Writing - review & editing: TP.

Taesung Park serves as an editor of the Genomics and Informatics, but has no role in the decision to publish this article. Jurg Ott has declared no conflicts of interest.

The work of TP was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF), funded by the Ministry of Education (NRF-2022R1A2C1092497). The authors thank for Boram Kim for editing service.

Layout of individuals for a given genotype pattern, X

No. of individuals | |||
---|---|---|---|

Phenotype, Y | With X | Without X | Sum |

Affected, “case” | _{2} |
||

Unaffected, “control” | _{1} |
||

Observed support is given by

Number of significant (permutation, Bonferroni) or detected (FDR) genotype patterns for given overall significance level, α

α | Permutation analysis | Bonferroni correction | FDR-BY |
---|---|---|---|

0.001 | 0 | 2 | 2 |

0.01 | 0 | 9 | 3 |

0.02 | 2 | 11 | 11 |

0.03 | 2 | 16 | 13 |

0.04 | 2 | 18 | 61 |

0.05 | 2 | 19 | 11905 |

FDR-BY refers to the false discovery rate by the Benjamini-Yekutieli method; the result for α = 0.05 seems unreasonable.