Multi-block dataset
We focused on analyzing a multi-block dataset, considering the characteristics of each block. The multi-block dataset is a data type of horizontally concatenating more than two variable blocks. Usually, each block has different properties and forms and can also be partitioned by prior knowledge, but all blocks have the same number of observations. Suppose there are K blocks, and each block has pk number of variables (k = 1, ···, K). We can express the k-th block Xk as Xk = [x1, ···, xpk]. The total dataset X can be presented as [X1, ···, Xk, ···, XK].
As technology advances, this type of data is common in a variety of studies. For instance, in food science, blocks of variables could be physico-chemical measurements, sensory analysis data, and instrumental measurements [
6]. Multiomics, the typical multi-block dataset in the medical field, means a new biological analysis approach where the datasets are multiple omics, such as the genome, proteome, transcriptome, epigenome, and microbiome [
7–
9]. It usually focuses on associations between SNPs and traits, considering varying phenotypes.
Generalized canonical correlation analysis
Among various methods of dealing with a multi-block dataset, GCCA is used in this paper. Since it is extended from CCA, we start with an explanation of CCA. CCA is a method of inferring information from cross-covariance matrices. If there are two vectors X = (x
1, ···, x
n) and Y = (y
1, ···, y
m) of random variables and if there are correlations among the variables, the linear combinations of x
i and y
j to maximize the correlation with each other—termed canonical variables— are found through CCA [
10].
GCCA is a way of extending CCA to adapt to more than two sets of random variables after removing dependencies within each set. The basic structure of CCA is to derive a new linear combination of the variables, called canonical variables, constituting each set and to estimate the correlation between canonical variables. In other words, canonical variables summarize the information inherent to the abbreviated set of multivariate data [
11]. GCCA can be divided into two methods: using correlations and using covariance. In our paper, we use a method of analysis based on covariance. GCCA based on covariance uses the variance of ‘block scores’ to compute the residual matrices. For instance, in an X
k variable block, we can denote
ak = (
ak1 ak2 ···
akpk)′ as the coefficients for each variable in X
k block. Therefore, the canonical variables, y
k (k = 1, ···, K), are expressed as:
The optimization problem is as follows.
The optimization problem tries to find coefficients of each block
a1,
a2, ···,
aK that would maximize the weighted summation of the covariance of the two components. The c
ij in the equation implies the relationship between variable block X
i and X
j. If they have a relationship, we could assign c
ij = 1; otherwise, we could assign c
ij = 0. The function g() can be various functions, such as horst (g(x) = x), centroid (g(x) = |x|), and factorial (g(x) = x
2). Among these methods, we applied horst methods. A design matrix C = (c
jk) is pre-specified by the user to express the relationships between blocks. The element c
jk is equal to 1 if block j and block k are connected and 0 otherwise [
12,
13].
Data description: KARE
In this paper, we use data from the KARE project, which was initiated in 2007 to undertake a large-scale genome-wide association analysis among the 10,038 participants of two areas: Anseong and Ansan. It was established as part of the Korean Genome Epidemiology study (KoGES) in 2001, providing genomic and clinical variables for over 260 traits [
14].
Among the KARE data, we decided to make three different variable blocks that included information on SNP variables, phenotype variables, and disease variables.
Fig. 1 represents the multi-block form of the KARE data, which had three variable blocks, each with different characteristics. The first variable block, X
1, is a block of SNP variables; the second variable block, X
2, is a phenotype block that has five phenotype variables related to obesity. The last variable block, X
3, is a disease block that has information on observational status in diabetes and hypertension.
The first variable block, X
1, has information on 35 SNP variables, and each piece of data was recorded as 0, 1, or 2 according to their genotype. We extracted 35 SNP variables to be included in our analysis according to the specific following steps described in
Fig. 2. The original KARE dataset has 311,779 variables, and we regarded 324 SNP variables as our main interest from the literature of Multi-QMDR analysis. The 324 SNP variables showed strong marginal effects in the univariate linear regression models in the paper [
15]. From the 324 selected variables, we selected 47 variables that showed a significant relationship with our phenotype variables in the phenotype block. Lastly, we removed extremely similar SNP variables that had a correlation of more than 0.98 with each other in order to clearly see the correlation between variables.
The second variable block, X
2, is a block of phenotype variables that have been proven to have a relationship with obesity. The five phenotype variables–suprailiac skinfold, subscapular skinfold, body mass index (BMI), waist-hip ratio, and waist–were selected, and all of them are related to obesity. The third variable block, X
3, is a block of diseases. Two disease variables were made from patients’ clinical traits. Participants whose “fasting blood glucose” was higher than 126, “blood glucose/oral glucose tolerance after 120 minutes” was higher than 200, or “who had medication of diabetes” were considered as having diabetes. Participants whose “subscapular skinfold” was over 140, “suprailiac skinfold” was over 90, or “who had medication of hypertension” were considered as having hypertension.
Table 1 shows how each group was composed of according to our disease definition. Excluding individuals with missing values among the variables used in this process, the final sample size was 7,389 in the study.
Sex and age were considered as potential covariates which could affect our association analysis; for instance, coronary heart disease (CHD) is more common in men than in women. In addition, the risk of CHD increases with age in both sexes, but the increase is sharper in women [
16]. Therefore, we divided the 7,389 observations into four groups based on a median age of 50 years and sex. Group 1 represents below the median age and males, while group 2 represents above the median age and males. Groups 3 and 4 represent females below and above the median age, respectively. For each group, our data were composed of three different blocks (J = 3)–a gene data block, clinical data block, and disease status block. As with many multi-block genomic data, KARE data also have very different characteristics for each block. The gene data block has 35 SNP variables that are discrete and can only have values of 0, 1, or 2. The five phenotype variables of the clinical data block were continuous. The disease status block consisted of only two dummy variables, indicating the presence or absence of disease.
To check the association between blocks, logistic regression analysis was performed, and the coefficients and p-values are listed in
Table 2.
Table 2 shows the simple logistic regression of the variables in the phenotype and genotype blocks for diabetes and hypertension.
In
Table 2, every variable in the phenotype block was statistically significant with diabetes, all with a p-value less than 0.001. However, in the regression with hypertension, the subscapular skinfold variable did not satisfy the significance level, which turned out to be a p-value of 0.578. Other than subscapular skinfold variable, all variables in the phenotype block showed significance of a relationship with hypertension. Therefore, there exists a relationship between each phenotype variable and two diseases, respectively. In contrast to the association between phenotype variables and disease variables, the association between SNP variables and disease variables were revealed only from certain genes. The genes related to diabetes were rs3856726, rs3782889, rs12229654, rs11066280, rs17092358, rs1378942, and rs11131794. The genes related to hypertension were rs221097, rs6965746, rs4472504, rs9939609, rs3782889, rs12229654, rs11066280, rs7136259, rs2254613, and rs1378942. There has been a study of SNPs associated with diabetes, in which the MYL2, C12orf51, and OAS1 genes were found to be significantly associated with 1-hPG, which has been understood as an additional risk factor for type 2 diabetes. Therefore, genes with rs3782889 (MYL2), rs12229654 (MYL2) were proven to have a valid relationship with diabetes [
17].