### Introduction

*cases*and

*controls*. A straightforward implementation considers class labels just as additional items in the database; in a resulting association rule,

*R: x*→

*y*, the right-hand side

*y*would then represent the class label while

*x*is a pattern of interest [13]. The

*confidence*of

*R*is given by the conditional probability,

*P*(

*y*|

*x*), and its support is defined as P(x), that is, the expected proportion of the

*x*pattern in the database. Based on the well-known

*Apriori*algorithm [14,15], FPM seeks to find all patterns (sets) of items occurring with a minimum frequency (

*support*) in a possibly large dataset. For example, supermarkets keep databases of items purchased by consumers. A set of items, X = “bread – butter – milk” represents such a pattern of groceries, and the number of individuals purchasing X is referred to as the support for X. Also, given a pattern, X, it is of interest to find an item (or another pattern), Y, with high conditional probability of occurrence. That is, the

*confidence, k*= P(Y|X), indicates the likelihood that consumers purchasing X will also purchase Y. In many FPM implementations [1], searches are restricted to patterns with given minimum support and minimum confidence.

*k*= P(Y = “case” | X). Clearly, the higher the confidence, the more predictive a pattern is for disease. If a pattern occurs only in cases and not in controls, then the observed confidence is equal to 100%.

### Selected Special Aspects of FPM

*approximate frequent itemsets*(AFIs) [19], which obviate the need for deleting data but are computationally more demanding than frequent itemset mining (FIM) methods in error-free data. Newer methods for mining AFIs have been developed and show excellent computational properties [20].

*frequent*patterns. However, patterns with properties other than being frequent may be more interesting. Various measures of interestingness have been discussed [21] and two will be mentioned here. (1) One specific property of patterns is their statistical significance—in general, it will be of interest to know whether a frequent pattern may be frequent by chance or whether there is more to it than randomness. Thus, mining significant patterns is of importance [22]. This is often achieved with some form of permutation analysis [23-25]. (2) In microarray data, mining frequent gene regulation patterns is an important task, but resulting patterns should show high utility; a corresponding utility model has been proposed that considers both the gene-disease association and the degree of expression level [26]. A survey of high utility itemset mining has recently been published [27].

### Multifactor Dimensionality Reduction

### Machine Learning Approaches

### Statistical Significance and Discovery

*i, j*) refers to the set of two alleles. For two variants, possibly located on different chromosomes, there will be nine possible genotype patterns (pairs of genotypes). For example, the pattern (3, 2) refers to genotype 3 at variant 1 and genotype 2 at variant 2.

*s*and confidence

*k*. For a given pattern, X, the number of individuals may be displayed in a 2 × 2 table as shown in Table 1. A common statistic to measure association of X and Y is Pearson’s chi-square,

*X*

^{2}, and its associated nominal empirical significance level,

*p*, where

*p*is the probability that chi-square is as large as

*X*

^{2}or larger just by chance, that is, assuming no association between X and Y. We generally want to find results with an associated very small

*p*-value such that we are confident that the result could not have been obtained by chance alone and is in fact due to an effect of X on disease.

### Bonferroni Correction

*m*, of chi-square results and associated nominal (raw) p-values,

*p*

_{i},

*i*= 1…,

*m*, each obtained for a suitable genotype pattern (support and confidence larger than specified minimum values), where

*m*can range from a few dozen to many millions. Because they are based on pairs of variants with a given variant possibly participating in multiple pairs, these p-values are likely correlated.

*p*-values significant only if

*p*

_{i}< α/

*m*, where α is the overall rate of false positive results. This result can be rephrased in terms of Bonferroni-adjusted p-values,

*p*

_{i}

^{B}= min(

*m p*

_{i}, 1) that are valid for any dependence among p-values, although the Bonferroni correction tends to be conservative for strongly correlated p-values [61].

*k*

_{0}=

*N*

_{2}/(

*N*

_{1}+

*N*

_{2}), where

*N*

_{1}and

*N*

_{2}are the numbers of controls and cases, respectively. Thus, our search criteria should impose a minimum confidence of

*k*

_{0}rather than the usual 80% or 90%, which will guarantee that the full range of p-values from 0 through 1 will be exhibited. In practice, this will often lead to very large numbers of patterns, which is the price we pay for applying a Bonferroni-type multiple-testing correction. As will be seen in our practical examples, we may not need to rigorously apply a minimum confidence of

*k*

_{0}and working with the Bonferroni correction turns out to be rather reasonable.

*m*´ = √

*m*instead of

*m*in the Bonferroni correction. Depending on the specific data analyzed, it may be possible to identify some patterns as being unobservable, which can lead to a number

*m*´ of tests considerably smaller than

*m*, that is, the multiple testing burden is alleviated to some degree [64]. A method directly estimating the correlation between tests and thus deriving a number

*m*´ of effectively independent test was developed some 20 years ago [65] but, despite its elegance, it does not seem to have been applied very often.

### Permutation Testing

*largest*test statistic,

*T*

_{i}(here, chi-square), obtained for each such null dataset is recorded. Then, for each chi-square obtained in the observed data, the proportion of

*T*

_{i}values larger than or equal to the observed chi-square is an estimate for the

*p*-value associated with that observed chi-square [66]. Permutation testing has been carried out in many areas of research [67]; specific improvements in the data mining area have recently been described [24].

*p*-values obtained. On the other hand, for each null permutation, the whole process of searching for genotype patterns must be repeated and this should be done at least 1,000 times.

### False Discovery Rate

*statistically significant*. A different approach to this situation is to focus on significant results and ask, what proportion of them is false? This proportion is called the false discovery rate [68], false discovery rate (FDR), and results with FDR smaller than some limit like 0.01 are called

*discovered*. Several methods for determining the FDR have been described; the most reliable, and most conservative, of these is the Benjamini-Yekutieli method [69]. As with the Bonferroni correction, we need to consider the whole range of p-values, that is, a potentially very large number

*m*of genotype patterns with minimum confidence of

*k*

_{0}=

*N*

_{2}/(

*N*

_{1}+

*N*

_{2}). The p-values are ranked from small (

*p*

_{1}) to large (

*p*

_{m}), and the largest

*p*

_{r}<

*r*× α/[

*m*× c(

*m*)] is determined, where

*r*is the rank and c(

*m*) = 1 + 1/2 + 1/3 + … + 1/

*m*is the harmonic series. All values of

*p*<

*p*

_{r}are considered discovered [61].

*p*-value is significant then it is also discovered [68].

### Results

*N*variants with the largest single-locus disease association [72]. However, this practice is fallacious as a strong disease association of a variant has the effect that it will show up in many genotype patterns. In other words, a genotype pattern may show a strong disease association mostly because of main effects of one or the other of the two participating variants rather than an interaction effect [17]. If the aim is to uncover interaction effects, the recommended approach is to first

*remove*individually significant variants and only then proceed to genotype pattern analysis.