Understanding the role of the microbiome in human health and how it can be modulated is becoming increasingly relevant for preventive medicine and for the medical management of chronic diseases. The development of high-throughput sequencing technologies has boosted microbiome research through the study of microbial genomes and allowing a more precise quantification of microbiome abundances and function. Microbiome data analysis is challenging because it involves high-dimensional structured multivariate sparse data and because of its compositional nature. In this review we outline some of the procedures that are most commonly used for microbiome analysis and that are implemented in R packages. We place particular emphasis on the compositional structure of microbiome data. We describe the principles of compositional data analysis and distinguish between standard methods and those that fit into compositional data analysis.

The study of the human microbiome and its role in human health is an active area of research. The human microbiome is involved in a large number of essential functions, like food digestion and modulation of the immune system, and alterations in microbiome composition may have important effects on human health. Many diseases have already been found to be associated with changes in the human microbiome. Different studies have shown that obesity is indeed partly determined by the composition of our gut microbiome. Chronic inflammatory skin conditions such as psoriasis, atopic dermatitis, acne and chronic skin ulcers have been associated to cutaneous microbiome changes. The colonic microbiota is suspected to be involved in the development of colorectal cancers. Inflammatory bowel diseases have long been associated to interactions between microbes and the host since the microbiome is essential for the activation of host immune responses. Microbial diversity is significantly diminished in Crohn disease. Early childhood antibiotic exposure has been associated with significantly increased risk for Crohn disease [

The terms microbiome and microbiota are used indistinctly to describe the community of microorganisms that live in a given environment. High-throughput DNA sequencing technologies have powered microbiome research by enabling the study of the genomes of all microorganisms of a given environment and a more precise quantification of microbiome abundances and function.

Amplicon sequencing relies on sequencing a phylogenetic marker gene after polymerase chain reaction (PCR) amplification. For bacteria and archaea, the marker gene is the 16S ribosomal RNA gene that encodes the RNA component of the small ribosomal subunit. The 16S rRNA gene contains both highly conserved areas and hypervariable sites, denoted as V1–V9. The conserved regions can be targeted with PCR primers while the hypervariable regions are specific to each microbial species and make possible to distinguish the different microbes. The V1–V3 and V4 regions are most commonly targeted. PCR amplification creates thousands to millions of copies (amplicons) of the DNA target region. PCR amplicons are then sequenced using high-throughput sequencing platforms and multiple nucleotide sequences, also known as reads, are obtained [

There are a number of bioinformatic pipelines available for processing microbiome 16S sequence data, the two most popular for amplicon sequencing are mothur [

Preprocessing and quality control filtering consists on first assign the sequences to samples (demultiplexing) and then sequences are quality filtered to remove too short sequences, too many ambiguous base pairs and chimeras. OTU binning is the process of clustering similar DNA sequences into OTUs, that is, groups of DNA sequences with at least 97% similarity. The different sequences assigned to an OTU are represented by a consensus sequence determined by the most common nucleotide at each position. Taxonomy assignment is then obtained by comparing OTU consensus sequences to microbial 16S rRNA reference databases such as GreenGenes (

Shotgun metagenomics sequencing involves sequencing the total microbial DNA of a sample, instead of just a particular marker gene. With this technique, we can infer the relative abundance of every microbial gene and quantify specific metabolic pathways to predict the potential functionality of the entire community. This is achieved by mapping the obtained sequences against a database such as Kyoto Encyclopedia of Genes and Genomes (KEGG;

From a statistical point of view, the output of both microbiome approaches, amplicon and shotgun sequencing, is similar: an abundance table of counts representing the number of sequences per sample for a specific taxon or the number of sequences matching a specific gene function. In this paper we illustrate the methodologies with data from 16S rRNA amplicon sequencing but most approaches also apply for microbiome shotgun metagenomics.

There are many reasons why the analysis of microbiome data is so challenging. On one hand, we face the usual challenges of count data analysis, i.e., skewed distribution, zero inflation and over-dispersion. Because of the experimental process and quality control filtering, microbiome data is very noisy and the total number of counts per sample is highly variable, which requires some normalization prior to the analysis so that the microbiome abundances among the different samples are comparable. Abundance tables are usually sparse since many species are infrequent. There is much redundant information because of co-abundance of many species. Moreover, the total number of counts per sample is constrained by the maximum number of sequence reads that the DNA sequencer can provide. This total count constraint induces strong dependencies among the abundances of the different taxa characterizing the compositional structure of microbiome data. Ignoring the compositionality of microbiome data may yield spurious results. In section 2, we describe the main principles of compositional data analysis.

The statistical analysis of microbiome abundance data usually starts with the normalization of the data followed by an exploratory study of the microbiome composition for the identification of possible data structures. The exploratory part consists of the analysis of diversity measures and their visualization through ordination plots, a term used in ecology to refer to several multivariate techniques for visualization of species abundance in a low-dimensional space. Subsequently, an inference analysis is performed where microbiome composition is tested for association with a variable of interest; this is known as differential abundance testing when the outcome of interest is dichotomous (i.e., disease status). These association tests can be multivariate, when the interest is to assess for global differences in microbial composition between sample groups, or univariate, with the aim of identifying which taxa are differentially abundant between sample groups. However, as we discuss later, univariate approaches for microbiome analysis are questionable and their results should be regarded with caution.

In sections 3 and 4, we describe the procedures that are commonly performed in a microbiome statistical analysis: normalization, diversity analysis, ordination and differential abundance testing, both, multivariate and univariate. This is not intended to be an exhaustive or systematic review of all the available methods. We outline some of the most widely used techniques for microbiome analysis, especially those that are implemented in R packages. We distinguish between standard methods and those that fit into compositional data analysis.

Microbiome data is compositional because the information that abundance tables contain is relative. In a microbiome abundance table, the total number of counts per sample is highly variable and constrained by the maximum number of DNA reads that the sequencer can provide. This total count constraint induces strong dependencies among the abundances of the different taxa; an increase in the abundance of one taxon implies the decrease of the observed number of counts for some of the other taxa so that the total number of counts does not exceed the specified sequencing depth. Moreover, observed raw abundances and the total number of reads per sample are non-informative since they represent only a fraction or random sample of the original DNA content in the environment. These characteristics of microbiome abundance data clearly fall into the notion of compositional data.

Compositional data are defined as a vector of strictly positive real numbers

with a constraint or non-informative total sum. The elements of a composition are called components or parts. In a composition the value of each component is not informative by itself and the relevant information is contained in the ratios between the components or parts [

Mathematically, the assertion that the relevant information is contained in the ratios between the components implies that two proportional compositions are equally informative and this induces equivalence classes of vectors carrying the same information. Two vectors are compositionally equivalent if they are proportional. Each equivalence class has a representative in the unit simplex defined as:

The simplex is thus the sample space of compositional data. In microbiome analysis, for example, both the raw counts and their transformation into relative abundances or proportions belong to the same equivalence class and they carry the same relative information.

Three important conditions should be fulfilled for a proper analysis of compositions: permutation invariance, scale invariance and sub-compositional coherence [

Aitchison [

The generalization of a log-ratio is a log-contrast function defined as a linear combination of logarithms of the components with the restriction that the sum of the coefficients is equal to 0:

Log-contrast functions are suitable for CoDA because they are scale invariant.

As an alternative to working in the simplex, several data transformations have been proposed that transform compositional data to the real space where classical statistical analysis can be applied. All of them are based on log-ratios between components.

The additive log-ratio transformation (alr) is the first proposal introduced by Aitchison [_{k}, the alr transformation is defined as:

Aitchison also defined the centered log-ratio transformation (clr) to treat the parts symmetrically. The clr transformation is given by:

where g(x)=

The third alternative is the isometric log-ratio transformation (ilr) and consists in the representation of a composition given a particular orthonormal basis in the simplex. It overcomes the problem of the singular covariance matrix present in the clr-transformation. For a detailed description see Egozcue et al. [

The main element of a microbiome study is the microbiome abundance table, a matrix of counts, X, with n rows (samples) and k columns (taxa) where each entry _{ij}

The large variability of the total counts per sample prevents meaningful comparisons of raw abundances between individuals. This is usually addressed through normalization of raw counts before the analysis. The most simple and frequently used normalization is the computation of relative abundances by dividing the raw abundances by the total number of counts per sample. Another popular normalization approach is rarefaction, which consists on subsampling the same number of reads for each sample so that all samples have the same number of total counts. Rarefaction is not recommended because it entails the loss of important information [

CoDA techniques do not require the normalization step because the log-ratio approach involves working with ratios between components and this cancels the effect of the total counts per sample. Instead, CoDA methods entail the imputation of zeros. Microbiome abundance tables are sparse, they contain many zeros, and this should be properly addressed before compositional data methods can be applied. The simplest approach is to replace zeros by a small pseudo-count or to add a small constant to all the elements of the abundance matrix. As an alternative, Martín-Fernández et al. [

The diversity of the microbiome is an important indicator of the good or bad conditions of the ecosystem, with larger microbiome diversity being usually associated to better health status. Microbiome diversity can be assessed through multiple ecological indices that can be divided into two kind of measures, alpha and beta diversity. Alpha diversity measures the variability of species within a sample while beta diversity accounts for the differences in composition between samples. The R package vegan provides a large set of diversity measures [

The most important measure of alpha diversity is richness, defined as the number of different species present in an environment. Richness is estimated by the observed richness, _{obs}

where _{1}_{2}

Another important indicator of alpha diversity is evenness, which measures the homogeneity in abundance of the different species in a sample. A commonly used measure of evenness is the Shannon index defined as

where _{i}

Beta diversity measures the differences in microbiome composition between samples. There is a wide range of ecological distances or dissimilarities for measuring how close are two microbial compositions. The most commonly used are Bray-Curtis, UniFrac and weighted UniFrac distances. We also define the Aitchison distance which is a proper distance for compositional data.

Let _{1}= (_{11},…,p_{1k}_{2}= (_{21},…,p_{2k}

Bray-Curtis is defined as follows:

UniFrac family of distances [_{1}= (q_{11},… ,q_{1r}_{21},… ,q_{2r}

The unweighted UniFrac distance measures the relative length of those branches that lead exclusively to species present in only one of the two samples with respect to the total length of all branches in the tree:

The unweighted UniFrac distance only takes into account the presence or absence of the taxa but Lozupone et al. [

For a proper CoDA analysis, a distance must be subcompositionally dominant, which means that the distance between two points in a multi-dimensional space should always be larger than their distance when projected in a lower dimensional space (sub-composition). Most commonly used distances in microbiome analysis, such as, the Bray-Curtis and the weighted and unweighted UniFrac distances are not sub-compositionally dominant, and this may induce sub-compositionally incoherencies that question the reliability of the results of any distance-based analysis [

The Aitchison distance is a sub-compositionally coherent distance defined as the Euclidean distance after the clr-transformation of the compositions. Given two compositions _{1}_{2}

where _{E}

The goal of ordination plots is the visualization of beta diversity for identification of possible data structures. The multidimensional data is represented into a reduced number of orthogonal axes while keeping the main trends of the data and preserving the distances among samples as much as possible. Most commonly used ordination methods for microbiome data are principal coordinates analysis (PCoA), also known as multidimensional scaling, and non-metric multidimensional scaling (NMDS) [

PCoA an extension of Principal Components Analysis (PCA). Given a distance or dissimilarity matrix, _{c}_{c}_{c}

In order to avoid this problem NMDS is more commonly used. Also based on a distance matrix

Ordination plots can be obtained with the R and Bioconductor packages vegan and phyloseq, among others [

Multivariate differential abundance testing refers to a global test of differences in microbial composition between two or more groups of samples. We can distinguish between distance-based or model-based approaches.

Permutational Multivariate Analysis of Variance Using Distance Matrices, PERMANOVA [

A related and popular distance-based approach is the analysis of similarities [

An interesting model-based approach for multivariate microbiome analysis is Kernel machine regression (KMR), that extends PERMANOVA to a regression framework [

or as a semiparametric logistic regression model for a dichotomous response variable

In the context of microbiome analysis, X is the microbiome abundance matrix and the non-parametric component

The nonparametric component is related to a Kernel matrix that is a transformation of the distance matrix

Among the different model-based methods for microbiome differential abundance testing we highlight the work by La Rosa et al. [

Le Cao et al. [

When significant global differences in microbiome composition are detected between groups of samples, a natural question arises: which particular taxa are responsible of that global difference? A common strategy to answer this question is to test every taxa separately for association with the response variable. When the response variable is dichotomous this is known as univariate differential abundance testing.

Below we describe both, classical and CoDA approaches for univariate differential abundance testing. However, we advise that classical univariate approaches are notably affected by the compositional structure of microbiome data and their results, with large false discovery rates, might be questioned [

Nonparametric tests, like the Wilcoxon rank-sum test or the Kruskal-Wallis test, can be applied. However, more powerful parametric approaches are available, such as the Bioconductor packages edgeR [

Two CoDA methods that explicitly accounts for the compositional nature of microbiome data are ANCOM [

Recently, Rivera-Pinto et al. [

Let _{1},x_{2},...,x_{k}_{A}_{B}_{A}_{B}_{A}_{B}

where C is a normalization constant. The larger the values of balance

The goal of the proposed algorithm is the identification of the two groups of microbial taxa, group

The algorithm for the selection of microbial balances is implemented in the R package selbal. It starts with a first thorough search of the two taxa whose balance, or log-ratio, is most associated with the response variable. Once the first two-taxon balance is selected, the algorithm performs a forward selection process where, at each step, a new taxon is added to the existing balance such that the specified optimization criterion is improved (area under the receiver operating characteristic or mean squared error). The algorithm stops when there is no additional variable that improves the current optimization parameter or when the maximum number of components to be included in the balance is achieved. This number is established with a cross-validation procedure, which is also used to explore the robustness of the identified balance.

In this work we present some of the techniques that are most commonly used for microbiome analysis. We place a particular emphasis on those methods that preserve the principles of compositional data analysis.

Classical methods that ignore the compositional nature of microbiome data can result in spurious correlations and sub-compositional incoherencies. This is especially relevant for classical univariate test where the strong dependencies between microbial abundances results in an important increase of type I error. Simulation studies show that the false discovery rate increases as the true-positive fold change increases and that it can achieve unacceptable extremely large values [

There is an increasing awareness of the need of using proper CoDA methods for microbiome analysis [

Normalization is not required and only zero imputation is needed. Diversity analysis and ordination can be performed after clr or ilr transformations, for instance, Aitchison distance and PCA. CoDA adapted Kernel machine regression can be used for multivariate differential abundance testing. Univariate approaches are not recommended. Penalized multivariate regression, such as sPLS-DA, is an alternative for the identification of the taxa that are most associated with the outcome. The algorithm selbal for the selection of microbial signatures is also an alternative to univariate selection of taxa when the main interest is prediction.

Even so, more research is still needed to fully understand the performance and limitations of the current available CoDA methods for microbiome analysis that will probably lead to their improvement or to the proposal of new approaches.

No potential conflict of interest relevant to this article was reported.

This work was partially supported by the Ministerio de Economía y Competitividad, Spain, reference MTM2015-64465-C2-1-R.

Main steps of a microbiome study: (1) microbial DNA extraction and sequencing, (2) bioinformatics sequence processing, and (3) statistical analysis.