Central Compositional Subspace Analysis
- Central compositional subspace is a mathematically defined, minimal subspace that captures all information about a response variable in compositional data while respecting the simplex constraints.
- It enables a direct and interpretable dimension reduction by using column-stochastic reduction matrices, avoiding data distortions from traditional SDR methods.
- Estimation via CKDR produces a sparse and consistent subspace, facilitating dual visualization with ternary plots to reveal underlying group patterns in fields like microbiomics.
A central compositional subspace is a mathematically defined, identifiable subspace that captures all the information about a response variable that is encoded in high-dimensional compositional data, respecting the simplex geometry and inherently driven by the constraints of compositionality. Within the framework of interpretable dimension reduction for compositional data (where each high-dimensional data point is a vector lying in the unit simplex), the central compositional subspace is defined as the intersection of all compositional sufficient dimension reduction (CSDR) subspaces—each representing the row space of a column-stochastic reduction matrix that renders the response conditionally independent of the original composition given the low-dimensional aggregated representation. This objective-oriented subspace underpins a new methodology for direct, interpretable, and statistically principled dimension reduction designed to accommodate the zero-boundaries and dependency structure of compositional data.
1. Motivation and Background
Dimension reduction of high-dimensional compositional data is fundamentally complicated by the simplex constraint: the components of each data point are nonnegative and sum to one. In standard (Euclidean) sufficient dimension reduction (SDR), one seeks a matrix such that , where typically and is unconstrained. However, direct application of traditional SDR is ill-posed for compositional data because the closure property of the simplex causes the intersection of all SDR subspaces to be trivial: for to be conditionally independent of given any linear reduction that respects the compositional constraint, the intersection reduces to the zero subspace.
To address this, a compositional SDR framework imposes compositionality directly on the reduction mechanism by requiring reduction matrices to be column-stochastic, aligning all reductions and their resulting subspaces within the simplex structure. This paradigm shift enables a meaningful, interpretable, and nontrivial definition of the "central compositional subspace."
2. Mathematical Definition and Properties
Compositional Sufficient Dimension Reduction (CSDR)
Let be a -dimensional composition (, ), and let denote a response variable. A CSDR reduction is defined via a matrix : so that
where is a low-dimensional composition resulting from a "soft amalgamation" of the original variables.
Central Compositional Subspace
The central compositional subspace is defined as
This subspace is minimal in the sense that every compositional reduction sufficient for must have a row space containing . It is the unique target for interpretable reduction of the simplex-valued with respect to predicting .
A pivotal property of the definition is that the mapping preserves compositionality, guarantees interpretability (each new variable is itself a composition of originals), and enables post-reduction graphical analysis using simplex geometry (e.g., ternary plots when ).
3. Estimation via Compositional Kernel Dimension Reduction (CKDR)
To estimate the central compositional subspace from data, the compositional kernel dimension reduction (CKDR) method is introduced. This method operates by optimizing a loss that measures the conditional independence between and given , using reproducing kernel Hilbert space (RKHS) conditional covariance operators.
The objective function is
where is the conditional covariance of given the reduced composition .
In practice, empirical estimation uses centered Gram matrices and , and a ridge-regularized objective: with regularization parameter , solved by projected gradient descent with simplex projections for each column of .
CKDR thus yields an estimator whose row space approximates . The framework is explicitly designed to accommodate zeros and avoids distortions from log-ratio transforms or ad hoc zero handling.
4. Theoretical Guarantees: Consistency and Sparsity
The estimator is shown to be consistent for under standard regularity conditions and suitable decay of (e.g., with as ). The convergence of the estimated subspace can be quantified via the chordal distance metric: where and are projection matrices onto and , respectively.
Due to the geometry of , the estimated reduction is typically sparse: most columns have nearly all mass on a single or small number of rows, revealing latent groupings or patterns among the original variables. This inherent sparsity often exposes meaningful amalgamations and does not require imposing explicit sparsity penalties.
5. Visual and Interpretative Implications
A salient advantage of compositional SDR via central compositional subspaces is dual interpretability. For reductions to dimension , the projection lives in a two-dimensional simplex and is naturally visualized via ternary plots, facilitating clear geometric discrimination among groups (e.g., case vs. control in biomedical data).
Simultaneously, each column of the reduction matrix is itself a composition and can be displayed on a ternary plot (“variable allocation plot”). This plot reveals which original variables most contribute to each low-dimensional amalgamation, allowing direct substantive interpretation.
This dual visualization approach eases the understanding of both the reduced data structure and the meaning of the compression itself—enabling direct graphical exploration of complex, high-dimensional compositional patterns without relying on axis-rotated or transformed data.
6. Applications and Practical Relevance
The central compositional subspace framework, with estimation via CKDR, is particularly apt for high-dimensional compositional data domains such as human microbiome, geochemistry, ecology, and genomics, where interpretability and adherence to the simplex constraint are paramount.
For example, in analyses of pediatric Crohn’s disease ileum microbiome data, CKDR-based ternary plots of projected samples distinguished disease from healthy groups, and variable allocation plots linked certain clusters of microbial taxa to disease. In vaginal microbiome studies predicting Nugent score, similar approaches revealed which taxa are overrepresented in different diagnostic groups.
The methodology yields interpretable, sparse compressions that directly identify meaningful compositions underlying biological phenomena, providing both graphical and statistical clarity.
7. Comparison with Classical and Contemporary Approaches
Classical dimension reduction techniques—including PCA applied to log-ratio transformed data—typically violate compositional constraints and can both distort data and create ill-posedness at the boundary (zero) points, often requiring ad hoc zero imputation procedures that compromise interpretability.
The central compositional subspace approach circumvents these problems by:
- Avoiding extra transformations (operating directly in the simplex).
- Utilizing reduction mappings that are column-stochastic, maintaining the compositional geometry.
- Delivering an identifiable, minimal, and interpretable subspace.
- Yielding estimators with sparsity that directly reveal underlying amalgamations, without external penalization.
A plausible implication is that future work in compositional data analysis and multi-view learning may benefit by adopting compositional subspace methodologies, both for interpretability and for robust handling of zeros and sparsity.
Summary Table: Central Compositional Subspace Specification
Aspect | Classical SDR | Compositional SDR with Central Subspace |
---|---|---|
Reduction Matrix | Unconstrained linear | Column-stochastic (simplex-respecting) |
Existence of Central Subspace | Possible | Nontrivial only under compositional SDR |
Interpretability | Difficult | Direct (compositional, sparse) |
Zero Handling | Problematic | Naturally accommodated |
Visualization | Indirect | Dual ternary plots (data & allocation) |
The central compositional subspace thus formalizes a theoretically robust, geometrically sound, and interpretability-driven paradigm for dimension reduction in compositional data, enabling both rigorous statistical inference and intuitive analysis of high-dimensional problems where compositionality is fundamental (Park et al., 6 Sep 2025).