### Introduction

### Microbiome Compositional Data

_{k}, the alr transformation is defined as:

### Exploratory Analysis of Microbiome Data

*x*provides the number of sequences (reads) corresponding to taxon

_{ij}*j*in sample

*i*. Sometimes abundance tables are transposed, rows are taxa and columns are samples. Apart from the abundance table, other elements that may be available for microbiome analysis are the sample data, the taxonomy table, and the phylogenetic tree. Several R and Bioconductor packages, such as phyloseq, are designed to facilitate the integration of all these elements in a microbiome analysis [13].

### Normalization

### Diversity analysis

* Alpha diversity: within sample diversity*

*R*, the number of different species observed in the sample. The observed richness tends to underestimate the real richness in the environment, where the less frequent species are likely to be undetected. There are different indices that adjust for this and try to estimate the hidden part that has not been detected. One of the most extended richness measure is Chao1 index defined as

_{obs}*f*is the number of species observed only once and

_{1}*f*is the number of species observed twice.

_{2}*p*represents the relative abundances of the i-th taxon.

_{i}* Beta diversity: between samples diversity*

*p*

_{1}= (

*p*) and

_{11},…,p_{1k}*p*

_{2}= (

*p*) denote the microbiome relative abundance of two different samples.

_{21},…,p_{2k}*b*=

*(b1,…,br*) represent the length of the different branches in the phylogenetic tree, and

*q*), and

_{1}= (q_{11},… ,q_{1r}*q2= (q*) the relative abundances associated to each branch for the first and the second sample, respectively.

_{21},… ,q_{2r}*x*and

_{1}*x*, the Aitchison distance is given by

_{2}*d*denotes Euclidean distance.

_{E}### Ordination

*D*, PCoA performs eigenvalue decomposition of

*D*'

_{c}*D*where

_{c}*D*is the centered distance matrix. When

_{c}*D*is the Euclidean distance, PCoA results exactly the same as PCA. Care must be taken with PCoA if the selected distance is not metric, because some eigenvalues may be negative and then, the graphical representation will not perform properly.

*D*, NMDS maximizes the rank-based correlation between the original distances and the distances between samples in the new reduced ordination space. The procedure starts with a random configuration and the optimal representation is obtained following an iterative procedure that at each steps improves the rank correlation.

### Microbiome Statistical Inference

### Multivariate differential abundance testing

*h*(X) measures the relationship between microbiome composition and the outcome. This association can be tested according to the following hypothesis:

*D*of pairwise distances between individuals. KMR is implemented for microbiome analysis in the R package MiRKAT [28]. KMR can be adapted to CoDA by using a subcompositionally dominant function, such as, the Aitchison distance. Rivera-Pinto [29] has implemented this adaptation in the R package MiRKAT-CoDA. The algorithm also includes a weighted version that allows the identification of the taxa that are more relevant for the joint association.

### Univariate differential abundance testing

### Microbial signatures

*x =(x*) be the microbial composition of

_{1},x_{2},...,x_{k}*k*taxa and, among these

*k*taxa, let's consider two disjoint subgroups of taxa, group

*A*and group

*B*, with composition abundances denoted by

*x*and

_{A}*x*, each group with

_{B}*k*and

_{A}*k*different taxa and indexed by

_{B}*I*and

_{A}*I*, respectively. The abundance balance between

_{B}*A*and

*B*, denoted by

*B*(

*A*,

*B*), is defined as the log-ratio between the geometric mean abundances of the two groups of taxa as follows:

*B*(

*A*,

*B*), the more abundant is group A with respect to group B. Positive values of

*B*(

*A*,

*B*) arise when group A is more abundant than group B while negative values of

*B*(

*A*,

*B*) correspond to larger abundance of group B relative to group A abundance. A value of

*B*(

*A*,

*B*)=0 correspond to a perfect balance between the abundances of both groups of taxa.

*A*and group

*B*, whose abundance balance

*B*(

*A*,

*B*) is most associated with an outcome of interest Y. For instance, for a binary outcome

*Y*corresponding to disease status (

*Y*=1 for diesased and

*Y*=0 for not diseased), if we are able to identify the two groups of taxa

*A*and

*B*whose balance is associated with

*Y*we may use

*B*(

*A*,

*B*), the relative abundance between groups

*A*and

*B*, as a microbial signature of disease risk. If large values of

*B*(

*A*,

*B*) are associated with

*Y*=1, we will infer that a person with larger relative abundances of group A with respect to group B will have higher risk of disease than other people with lower relative abundances between

*A*and

*B*.