Compositionality in 16S Data Explained

In response to

"I keep reading that I have to remember that my 16S data is compositional when I analyze it, but what does that mean? Wht do I need to do?"

The Fixed-Sum Constraint

16S sequencing doesn’t count bacteria, it counts reads. The sequencer produces a fixed library of reads (e.g. 100,000 per sample), and each taxon’s “abundance” is its share of that library. The sum of all relative abundances is always exactly 1. This is the compositional constraint.

The data lives on a mathematical object called the simplex: a D−1 dimensional space where all values are positive and sum to 1. On a simplex, you cannot freely vary one component without all others reacting. If Taxon A doubles its share, every other taxon’s share must decrease, even if those taxa haven’t changed at all in absolute terms.

Look at the demo below. Taxon A’s absolute cell count is controlled by the slider. Taxa B, C, and D have fixed absolute abundances throughout, so they are not changing. The top bar shows what is actually in the sample; the bottom bar shows what 16S sequencing reports.

Taxon A's cell count is controlled by the slider. Taxa B, C, D have fixed absolute abundances throughout. The top bar shows what is actually in the sample; the bottom bar shows what 16S sequencing reports.

This is not a bug in the sequencing protocol or data processing. It is an intrinsic property of count data that has been normalized to a constant sum: every amplicon sequencing dataset is subject to this constraint.

Why Standard Statistics Fail

The fixed-sum constraint creates two systematic failure modes for standard statistical methods.

Spurious correlations. Say we have 20 hypothetical samples containing three bacterial taxa. Taxon A varies dramatically in absolute abundance across samples. Taxa B and C are approximately constant at 200 and 300 cells respectively: they have no biological relationship to Taxon A or to each other.

Each point is one sample. Left: absolute abundances (what is biologically true). Right: relative abundances (what 16S reports). Taxa B and C are genuinely independent, but appear strongly correlated in relative space.

In relative space, Taxon B appears negatively correlated with Taxon A (r ≈ −1) and Taxon B and C appear positively correlated (r ≈ +1). Neither correlation reflects any biological interaction. What we’re seeing here is the “compositional correlation bias” described by Aitchison (1986): an inevitable mathematical consequence of the fixed-sum constraint, not a result of experimental noise.

When a high-abundance taxon increases, it suppresses the apparent abundance of everything else simultaneously, causing these spurious negative correlations between the driver and all other taxa (and spurious positive correlations among the displaced taxa).

Differential abundance artifacts. A t-test or Wilcoxon test on proportions asks “is this taxon’s relative abundance different between groups?” But that can produce a significant result when the taxon itself hasn’t changed at all (if a different taxon increased in one group and pushed this one’s relative abundance down). Conversely, a taxon that genuinely increased in abundance can appear unchanged or even decreased if it increased less than the total library grew. Standard tests have no way to distinguish these cases, which is why tools designed specifically for compositional data are needed.

The Log-Ratio Solution

Aitchison (1986) showed that the appropriate way to analyze compositional data is through log-ratios. Log-ratios remove the fixed-sum constraint by expressing each component relative to others in log space, lifting the data off the simplex into unconstrained real space where standard geometric and statistical operations apply.

Three common transforms are in use:

ALR (Additive Log-Ratio): log(xᵢ / xᵣ): divide each component by a chosen reference taxon. It’s simple, but results depend on reference choice, which is ultimately arbitrary.
CLR (Centered Log-Ratio): log(xᵢ / g(x)) where g(x) is the geometric mean across all taxa. No arbitrary reference, so this is the standard choice for ordination and correlation analysis.
ILR (Isometric Log-Ratio): Fully isometric; statistically ideal for hypothesis testing because the coordinates are orthonormal. The components have no direct biological interpretation, but the space is the correct one for formal inference.

The interactive below applies all five coordinate systems to the same 20 samples. Dots are colored by Taxon A’s proportion (blue = low, red = high), making it visible which taxon is structuring each pattern.

The same 20 samples plotted in five different coordinate systems. The dot color tracks Taxon A's proportion across samples. Toggle the tabs to see how each transform handles the B–C relationship.

The same 20 samples (sorted low-A to high-A). Left: as proportions, B and C are spuriously correlated — both are suppressed whenever A dominates. Right: the log-ratio log(B/C) is flat across all samples because B and C haven't changed in absolute terms. The dashed line marks the mean. Correlation with A is near zero.

The right panel shows the essential logic of log-ratio analysis: instead of asking “what fraction is taxon B of the total?” it asks “how does B compare to C?” — a ratio that is not distorted by changes in A. The Aitchison distance (Euclidean distance in CLR-transformed space) applies this principle across all taxa simultaneously, and Aitchison PCA (PCA on CLR-transformed data) is the compositionally appropriate alternative to PCA on proportions.

Note that CLR requires no zeros in the data, since log(0) is undefined. In practice, zero replacement is applied before transformation: common approaches include adding a pseudocount of 0.5 to all counts, or using multiplicative replacement (Martín-Fernández et al.) which preserves the ratios between non-zero components. The choice of replacement strategy has a modest but real effect on results, particularly in sparse datasets. Read up on this before you make a decision, honestly.

Bray-Curtis dissimilarity, widely used for ordination in community ecology, is not a log-ratio method; but it is empirically robust for beta diversity in most microbiome applications. It is a reasonable default for Unifrac-style and ordination analyses even though it doesn't fully address compositionality.

Choosing the Right Tools

Choosing the Right Tests

References

Aitchison J. The statistical analysis of compositional data. Journal of the Royal Statistical Society, Series B. 1982;44(2):139–177.
Gloor GB, Macklaim JM, Pawlowsky-Glahn V, Egozcue JJ. Microbiome datasets are compositional: and this is not optional. Frontiers in Microbiology. 2017;8:2224. doi:10.3389/fmicb.2017.02224
Mandal S, Van Treuren W, White RA, et al. Analysis of composition of microbiomes: a novel method for studying microbial composition. Microbial Ecology in Health and Disease. 2015;26:27663. doi:10.3402/mehd.v26.27663
Fernandes AD, Reid JN, Macklaim JM, McMurrough TA, Edgell DR, Gloor GB. Unifying the analysis of high-throughput sequencing datasets: characterizing RNA-seq, 16S rRNA gene sequencing and selective growth experiments by compositional data analysis. Microbiome. 2014;2:15. doi:10.1186/2049-2618-2-15
Lin H, Peddada SD. Analysis of compositions of microbiomes with bias correction. Nature Communications. 2020;11:3514. doi:10.1038/s41467-020-17041-7