### Introduction

In human genetic association studies with high-dimensional genomic data, a multiple group testing procedure is often required to identify genes or genetic regions that are associated with a disease or a trait since a gene or a genetic region usually contains multiple genetic sites or variants. For instance, single nucleotide polymorphism data, DNA methylation data and sequencing data consist of tens of thousands of genes where each gene has multiple genetic sites. In order to identify genes or genetic regions associated with a phenotype outcome, we need to conduct an individual group test for each gene. However, an individual test for high-dimensional genomic data suffers from multiple testing issues such as the control of family-wise error rate (FWER) or dependent tests. In this reason, Bonferroni adjustment or false discovery rate (FDR) control methods [1,2,3] should be performed after computing the p-value of a multiple group test for individual genes or genetic regions.

Alternatively, regularization procedures using a penalized likelihood can be applied for analysis of high-dimensional genomic data. Basically, regularization procedures perform variable selection based on a parametric regression with a penalty function, where a phenotype outcome regresses on all of genetic sites. As a tuning parameter for sparsity is decreasing, the most outcome-related genetic sites can be sequentially selected. One of the most popular regularization procedures for high-dimensional genomic data is lasso (least absolute shrinkage and selection operator) [4,5,6]. For group selection such as a gene or a genetic region, group lasso can be applied to high-dimensional genomic data that has a group or a cluster structure [7,8]. Since regularization procedures do not test but select a gene or a genetic region associated with a phenotype outcome, the control of FWER or FDR is not required. However, the selection of the optimal tuning parameter is crucial to determine the number of the outcome-related genes or genetic regions.

The main goal of both the individual group test and the group selection procedure is to identify disease/trait-related genes or genetic regions in analysis of high-dimensional genomic data. Although these two different statistical methods have the same goal, there have been rarely statistical literatures which compare either test performance or selection performance of two statistical methods since hypothesis testing and variable selection have been considered as completely different approaches in statistics. In genetic association studies, however, the testing procedure essentially determines whether each gene or each genetic region is significantly associated with a phenotype outcome or not, while variable selection makes a conclusion which genes or genetic regions are associated with a phenotype outcome. Therefore, we can directly compare the true positives of the group test procedure and the group selection procedure when they identify the same number of disease/trait-related genes or genetic regions.

In this article, we conduct extensive simulation studies in order to compare the performance of both group testing procedures and group selection procedure in terms of true positives. That is, the total number of correctly identified genes or genetic regions is compared when the same number of genes or genetic regions is detected. The simulation studies focus on a case-control association study with high-dimensional genomic data which has a group or a cluster structure. For group testing procedures, we consider commonly used three methods such as principal component analysis (PCA), Hotelling's T

^{2}test, and permutation test. For group selection procedure, we employ group lasso. These four statistical methods are also applied to real high-dimensional DNA methylation data where DNA methylation beta values of CpG sites from approximately 12,000 genes between ovarian cancer cases and healthy controls were generated from Illumina Infinium Human-Methylation27K Beadchip (Illumina Inc., San Diego, CA, USA).### Methods

### Principal component analysis

PCA is one of the most common statistical approaches to data dimension reduction [9]. It basically transforms multiple variables to have orthogonality so the first principal component can be expressed as a weighted linear combination of variables. The weights of the first component are computed such that the component can account for the greatest possible variance of multiple variables. In case-control association studies of genomic data with a group structure, we can apply PCA for each gene or genetic region, where data information of multiple genetic sites within the same gene or genetic region can be reduced to a single numerical vector corresponding to the first principal component. Then, a phenotypic association of the principal component can be tested based on the independent two sample T-test. Statistical approaches based on PCA have been widely applied to analysis of high-dimensional genomic data [10,11,12].

### Hotelling's T^{2} test

Hotelling's T

^{2}test is one of the representative multivariate tests used when significant differences between the mean vectors of two multivariate data sets are tested. It is known as a powerful multivariate test as long as data satisfies necessary assumptions such as random samples, multivariate normality and equivalent variance and covariance matrices between two groups. In genetic association studies with microarray data, the Hotelling's T^{2}test is often applied to find differentially expressed genes [13,14]. We also applied the Hotelling's T^{2}test for each gene or genetic region to test a significant difference between cases and controls. We employed an R package ‘Hotelling’ for simulation studies and real data analysis.### Permutation test

Permutation test is a nonparametric statistical test used when an underlying distribution of genetic data is not need to be assumed. If the derivation of a theoretical distribution of a test statistic is challenging, permutation test can be employed to compute an empirical p-value of the test statistic. For genomic data with a group structure, we first compute an individual p-value of a phenotypic association test for each genetic site. We then combine – ∑ i = 1 K log p i , where – ∑ i = 1 K log p i for each permutation set. After we obtain the empirical distribution of – ∑ i = 1 K log p i , we can easily calculate the empirical p-value from the original case and control set. This permutation based test for genomic data with a group structure has been demonstrated to be very efficient and powerful when we need to aggregate the information of multiple genetic sites [15].

*K*p-values such that*p*is the p-value of the the_{i}*i*-th genetic site, and*K*is the total number of genetic sites in the gene or genetic region. Next, we repeatedly permute case-control labels and compute### Group lasso

Statistical association tests above should be conducted to each gene or each genetic region one at a time, so the individual tests cannot consider genetic correlations among genes. But, genes linked with each other on genetic pathways or genes that have an interaction effects on a phenotype outcome can have a functional relationship with each other. Since their correlations could be important information for genetic association studies, a statistical model that can includes all of genetic information is often preferred. Regularization procedures can be conducted to entire genomes in a regression framework, where a phenotype outcome is regressed on all of genetic data. The solution of regression coefficients can be achieved, constraining the parameter space of regression coefficients. This constraint actually enables to obtain the coefficient solution even if the number of genes is much greater than a sample size. Depending on the parameter space constraint, a type of regularization procedure is determined. Group lasso regularizes regression coefficients such that the sum of L

_{2}norm of the coefficients for each group is less than an arbitrary value [7]. The arbitrary value is corresponding to a tuning parameter value for sparsity. For a fixed tuning parameter λ, the estimated regression coefficients β=(β_{1},β_{2},…,β_{m})^{T}of group lasso maximizes where l(β) is a logistic likelihood, and m_{k}is the total number of genetic sites of the k-th gene, i.e., β*=(β*_{k}_{k1}, β_{k2},…,β*)*_{kmk}^{T}. The*L*_{2}norm of β*is defined as*_{k}In genomic data analysis with a group structure, group lasso sequentially selects the most outcome-related gene or genetic region. The selection results rely on the tuning parameter value λ since we have different coefficient estimates for a value of λ. As λ decreases, the number of selected genes is gradually increased. In general, we select the k-th gene if the estimated regression coefficients β

*are nonzero. In group lasso, the estimated regression coefficients of all genetic sites belong to selected genes or genetic regions are nonzero while all genetic sites of unselected genes or genetic regions have the exactly zero regression coefficients. Therefore, group selection can be performed based on the solution of regression coefficients for a fixed value of λ. Note that the numerical values of estimated coefficients β*_{k}*are not of interest since group lasso performs selection but not prediction. That is, we see if β*_{k}*= 0 or not.*_{k}In our simulation studies and real data analysis, we first started with a relatively large λ value which is large enough that the solution to all regression coefficients can be exactly zero. In this case, no genes are selected. We then gradually decreased the value of λ until a single gene has nonzero regression coefficients of its genetic sites. This gene is considered as the top rank gene. In the same way, we can find the second ranked gene as λ continues to be decreasing. Eventually, we can obtain a list of top ranked genes. Since we compared a particular number of top selected genes with the same number of top significant genes computed by multiple testing procedures, we didn't need to find the optimal λ value in the simulation studies and real data analysis. Group lasso has been widely applied for analysis of high-dimensional genomic data [8,16]. We used an R package ‘gglasso’ for simulation studies and real data analysis.

### Results

### Simulation studies

In this simulation study, we compared true positive rates of three group testing procedures and one selection procedure such as PCA, Hotelling's T

^{2}test, permutation test, and group lasso selection procedure when the same number of genes is identified by four statistical methods. We conducted two different simulation studies, where the first simulation assumed that all genes are independent with each other and the second simulation assumed that genes are correlated with each other according to a given network information.In the first simulation studies, we assumed that a single gene consists of 5 genetic sites. Numerical data of the five genetic sites within the same gene were generated from a multivariate normal distribution of N(µ,Σ), where a mean vector µ has a different value between cases and controls if the gene is causal, but µ has the same value between cases and controls if the gene is noncausal. The covariance matrix Σ represents a correlation pattern of the five genetic sites within the same gene and we assumed an AR(1) correlation matrix, i.e., ∑ = { σ i j } 1 ≤ i , j ≤ 5 = ρ i − j , where ρ is a correlation coefficient fixed as ρ=0.3,0.5, or 0.7 in the simulation. We considered 1,000 genes so we have a total of 5,000 genetic sites in the simulation data, where 100 cases and 100 controls were generated. Only 20 genes out of 1,000 genes are assumed to be causal. Note that the simulated 1,000 genes are independent with each other.

Three group testing procedures were applied to individual genes and the p-values of testing the mean difference between cases and controls were computed for 1,000 genes. The p-values of each testing procedure were then listed from the smallest to the largest. Finally, top 20, 30, and 40 genes were selected for each testing procedure based on the 20, 30, and 40 smallest p-values, respectively. In contrast, group lasso procedure sequentially selects the most outcome-related genes as a tuning parameter for sparsity is decreasing. Since we can easily control the tuning parameter value, we were able to select the top 20, 30, and 40 genes. True positives rate of the selected genes from the four statistical methods were computed along with two different values of µ=0.3 and 0.4. That is, only 20 causal genes were assumed to have a mean difference by either 0.3 or 0.4 between cases and controls. True positive rates stand for the proportion of correctly identified genes among the 20 causal genes. The simulation was repeated 100 times and averaged true positive rates of four statistical methods over 100 simulation replications were summarized in Fig. 1.

In Fig. 1, it appears that all of true positive rates are overall increased as the mean difference is increasing and the correlation coefficient is decreasing. In high-dimensional data analysis, it is often observed that detection power is decreased due to highly correlated variables. We can see the similar result of decreased true positive rates for highly correlated genetic sites. When we compared four statistical methods, both PCA and permutation test seem to have the largest true positive rates in all simulation settings, while Hotelling's T

^{2}test show the lowest true positive rates in all settings. The true positive rates of group lasso procedure seem to be higher than those of Hotelling's T^{2}test, but slightly lower than those of both PCA and permutation test. Consequently, we can conclude that group lasso procedure shows similar selection performance as the group hypothesis testing procedures in the first simulation study.In the second simulation study, we generated 1,000 genes where each 100 genes are truly correlated with each other according to the simulated genetic network in Fig. 2. So, we have 10 network groups each of which consists of 100 genes. Similar to the first simulation study, genetic data was generated from N(µ,Σ), where the covariance matrix Σ is an inverse of a precision matrix Ω In a Gaussian graphical model [17], nonzero entries of the precision matrix correspond to links between two genes of a network graph. Therefore, we could obtain the precision matrix Ω according to the given network in Fig. 2 in the same way described [12,18]. Our simulation data consists of 100 cases and 100 controls over 1,000 genes, where only 45 genes within the same network have a different mean µ between cases and controls. Let us denote the

*j*-th gene by χ*. Since our simulation study should be conducted for genetic sites with a group structure, we additionally generated 10 genetic sites for each gene. If the first*_{j}*N*genetic sites among 10 sites are causal and the other*10*-*N*genetic sites are noncasual, the*N*sites of the the*j*-th gene were generated such that χ_{jk}= χ*+*_{j}*N*(0,σ^{2}),*k*=*1*,*2*, …,*N*and χ_{jk}=*N*(0,σ^{2}) for*k*=*N*+*1*, …,*10*. Finally, we have a total of 10,000 genetic sites with 200 samples. In the simulation, we discarded all simulated genes χ*for*_{j}*j*=*1*,*2*, …,*1,000*. Instead, we used only 10,000 genetic sites where each 10 sites consist of one gene. In this simulation setting, 10 genetic sites within the same gene are not only correlated with each other, but they are also correlated with the other 10 genetic sites within the different genes that are linked with each other according to the genetic network.We also applied four statistical methods used in the first simulation study. We fixed the number of causal genetic sites per gene as

*N*= 2, 5, or 10. Since only 45 genes are causal, the number of causal sites is 90, 225, or 450 among 10,000 sites, respectively. The standard deviation σ to control a noise level was set to be 1.5, 2, or 2.5. Higher standard deviation is likely to produce stronger noises, so true positive rates are expected to be decreased. Simulation was repeated 100 times and averaged true positive rates of top 10 genes to top 100 genes selected by four methods are summarized in Fig. 3.In Fig. 3, PCA overall shows the best selection performance except when the noise level is either moderate or strong, and the number of causal genetic sites is small, i.e.,

*N*= 2 and σ= 2 or 2.5. As the number of causal sites in a gene is decreasing, the true positive rates of three group testing procedures seem to be decreasing together. However, we can see that the true positive rates of group lasso are almost the same regardless of the number of causal sites in a gene. As we mentioned earlier, group lasso enforces the regression coefficients of genetic sites in the selected genes to be nonzero even if only a few genetic sites are causal and majority is noncausal. In this reason, the selection performance of group lasso does not affected by the number of causal and noncausal sites in the same gene. When all of genetic sites in the same gene are causal (*N*= 10), both PCA and permutation test overwhelm the other two methods. Since computation of the test statistics of two methods is based on individual genetic sites, they should be statistically powerful when all of sites in the same gene are causal. In the second simulation, the Hotelling's T^{2}test shows the worst selection of true positives in all simulation settings. This is due to relatively high correlation among genetic sites in the same genes. We have already seen that the true positive rates of Hotelling's T^{2}test were drastically decreased as the correlation was increasing in the first simulation study.### Analysis of ovarian cancer DNA methylation data

Next, we applied four statistical methods to real ovarian cancer DNA methylation data. Ovarian cancer DNA methylation data generated from Illumina Infinium Human-Methylation27K Beadchip has been already applied to identify CpG sites and genes associated with ovarian cancer from some different studies [19,20,21]. The methylation data set consists of 20,461 CpG sites from 12,770 genes with 152 controls and 119 cases. Many genes have either one or two CpG sites and some genes have up to 9 CpG sites. Since our four statistical methods can be applied to genomic data with a group structure, and our main goal of this study is to compare group testing procedures with group selection procedure, we excluded genes that have only one CpG site in the analysis. So, we ended up with 14,627 CpG sites from 6,936 genes which have at least two CpG sites.

In the exactly same way used in the simulation studies, we identified top 20 genes for each method. Table 1 shows the top 20 genes and their p-values computed by PCA and Hotelling's T

^{2}test. Also, 20 selected genes by group lasso procedure are included in the table. For the permutation test, we permuted the data over 1,000,000 times, but we found that the empirical p-values of 246 genes are still less than 10^{–7}. Due to time limit, we cannot reduce down the number of genes in the top list of permutation test anymore. Therefore, we had to finalize with top 246 genes detected by permutation test. For each statistical method, Fig. 4 summarizes the number of overlapped genes among top 20 genes by the other three methods.In Fig. 4, it appears that 15 genes in the top 20 list of PCA is also in the top 20 list of Hotelling's T

^{2}test, while only 2 genes in the top 20 list of PCA is in the top 20 list of group lasso. Similarly, we can see that only 3 genes in the top 20 list of Hotelling's T^{2}test is in the top 20 list of group lasso. Top 246 genes detected by permutation test include all of the 20 genes detected by both PCA and Hotelling's T^{2}test, but only 7 genes among the 246 genes are overlapped with top 20 genes selected by group lasso. This result indicates that genes selected by group lasso are quite different from genes detected by three group testing procedures.Next, we identified top 246 genes by each of four different statistical methods since the minimum number of genes detected by permutation test is 246. Fig. 5 summarizes the number of overlapped genes among 246 genes by the other three methods. It seems that three group testing procedures of PCA, Hotelling's T

^{2}test and permutation test have from 164 (66.67%) to 205 (83.73%) overlapped genes with each other. In contrast, group lasso selection procedure have only 24 (9.76%) to 31 (12.60%) overlapped genes with three group testing procedures. In this result, we can conclude that most of genes selected by group lasso are very different from genes detected by multiple group testing procedures in high-dimensional DNA methlyation data analysis.In the first simulation study three group testing procedures and group lasso selection procedure show very similar selection performance. However, selection performance of four methods was quite different from each other in the second simulation study. We demonstrated that true positive selection could vary on the number of causal sites and the noise level. In analysis of ovarian cancer DNA methylation data, we found that group lasso identified quite different genes, compared with three group testing procedures. In general, regularization procedures like group lasso are known as a good alternative method to testing procedures when we identify disease or trait associated genes with high-dimensional genomic data, where the number of genes far exceeds the number of samples. However, our investigation found that many genes selected by group lasso are rarely overlapped with genes detected by three group testing procedures, which detected many overlapped genes. The similar situation was observed when the number of causal sites is small and the noise level is relatively large in our second simulation study. In that case, true positive rates of group lasso are much higher than those of three group testing procedures, where three testing procedures shows the almost same selection performance. In real DNA methylation data multiple genes are usually highly correlated with each other. Moreover, the number of causal sites could be small and the noise level could be large. But, further biological investigation with genes selected by group lasso should be conducted to figure out the main reason of this big discrepancy between group lasso and group tests in analysis of DNA methylation data.

### Discussion

In this article, we compared group testing procedures with group lasso selection procedure when high-dimensional genomic data with a group structure are used for case-control genetic association studies. In statistics, hypothesis testing and variable selection are regarded as totally different statistical methods since their objectives are different from each other. Therefore, the comparison of these two statistical approaches has not been studied much. However, in genetic association studies with high-dimensional genomic data, both testing and selection procedures have the same goal which is to identify genes of genetic regions that are associated with either a disease or a trait. Particularly, many types of high-dimensional genomic data consist of multiple groups where each gene or each genetic region contains some genetic sites or variants. So, our research focused on the comparison of group testing procedures and group lasso selection procedure.

In simulation studies, we found that the selection performance of group lasso and group testing procedures could be similar or very different from each other. It depends on data structure such as correlation strength and patterns among genes, the number of causal sites in a gene and noise levels. In real data analysis, it was surprising that group lasso identified many different genes that are not detected by group testing procedures in ovarian cancer association similar or very different from each other. It depends on data structure such as correlation strength and patterns among genes, the number of causal sites in a gene and noise levels. In real data analysis, it was surprising that group lasso identified many different genes that are not detected by group testing procedures in ovarian cancer association studies of DNA methylation data. Although group lasso is known as one of the most commonly used selection methods in statistics when the number of variables is much greater than a sample size, it shows unexpected selection results in ovarian cancer data analysis. In contrast, multiple group testing procedures including PCA, Hotelling's T

^{2}test and permutation test identified almost the same genes associated with ovarian cancer. Since multiple group testing procedures are still the most popular method for medical doctors and geneticists to apply for high-dimensional genomic data with a group structure, we might need to further investigate the validity of group lasso selection procedure in genetic association studies. In future study, our investigation will focus on finding additional reasons that we had many genes detected by group test procedures but missed by group lasso in ovarian cancer data.