Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 82 tok/s

Gemini 2.5 Pro 53 tok/s Pro

GPT-5 Medium 19 tok/s Pro

GPT-5 High 20 tok/s Pro

GPT-4o 96 tok/s Pro

Kimi K2 179 tok/s Pro

GPT OSS 120B 473 tok/s Pro

Claude Sonnet 4 37 tok/s Pro

2000 character limit reached

Central Compositional Subspace Analysis

Updated 9 September 2025

Central compositional subspace is a mathematically defined, minimal subspace that captures all information about a response variable in compositional data while respecting the simplex constraints.
It enables a direct and interpretable dimension reduction by using column-stochastic reduction matrices, avoiding data distortions from traditional SDR methods.
Estimation via CKDR produces a sparse and consistent subspace, facilitating dual visualization with ternary plots to reveal underlying group patterns in fields like microbiomics.

A central compositional subspace is a mathematically defined, identifiable subspace that captures all the information about a response variable that is encoded in high-dimensional compositional data, respecting the simplex geometry and inherently driven by the constraints of compositionality. Within the framework of interpretable dimension reduction for compositional data (where each high-dimensional data point is a vector lying in the unit simplex), the central compositional subspace is defined as the intersection of all compositional sufficient dimension reduction (CSDR) subspaces—each representing the row space of a column-stochastic reduction matrix that renders the response conditionally independent of the original composition given the low-dimensional aggregated representation. This objective-oriented subspace underpins a new methodology for direct, interpretable, and statistically principled dimension reduction designed to accommodate the zero-boundaries and dependency structure of compositional data.

1. Motivation and Background

Dimension reduction of high-dimensional compositional data is fundamentally complicated by the simplex constraint: the components of each data point are nonnegative and sum to one. In standard (Euclidean) sufficient dimension reduction (SDR), one seeks a matrix $B$ such that $Y \perp\!\!\!\perp X \mid BX$ , where typically $X \in \mathbb{R}^d$ and $B$ is unconstrained. However, direct application of traditional SDR is ill-posed for compositional data $X \in \Delta^{d-1}$ because the closure property of the simplex causes the intersection of all SDR subspaces to be trivial: for $Y$ to be conditionally independent of $X$ given any linear reduction $BX$ that respects the compositional constraint, the intersection reduces to the zero subspace.

To address this, a compositional SDR framework imposes compositionality directly on the reduction mechanism by requiring reduction matrices to be column-stochastic, aligning all reductions and their resulting subspaces within the simplex structure. This paradigm shift enables a meaningful, interpretable, and nontrivial definition of the "central compositional subspace."

2. Mathematical Definition and Properties

Compositional Sufficient Dimension Reduction (CSDR)

Let $X$ be a $d$ -dimensional composition ( $X_j \geq 0$ , $\sum_{j=1}^d X_j = 1$ ), and let $Y$ denote a response variable. A CSDR reduction is defined via a matrix $P \in \mathcal{M}_{m,d}$ : $\mathcal{M}_{m,d} = \left\{ P \in [0,1]^{m \times d} : \sum_{i=1}^m p_{ij} = 1 \;\; \forall j\in \{1,\ldots,d\} \right\}$ so that

$Y \perp\!\!\!\perp X \mid PX$

where $PX \in \Delta^{m-1}$ is a low-dimensional composition resulting from a "soft amalgamation" of the original variables.

Central Compositional Subspace

The central compositional subspace $\mathcal{C}_{Y|X}$ is defined as

$\mathcal{C}_{Y|X} = \bigcap \left\{ \operatorname{row}(P) : P \in \mathcal{M}_{m,d} \text{ such that } Y \perp\!\!\!\perp X \mid PX \right\}$

This subspace is minimal in the sense that every compositional reduction sufficient for $Y$ must have a row space containing $\mathcal{C}_{Y|X}$ . It is the unique target for interpretable reduction of the simplex-valued $X$ with respect to predicting $Y$ .

A pivotal property of the definition is that the mapping $X \mapsto PX$ preserves compositionality, guarantees interpretability (each new variable is itself a composition of originals), and enables post-reduction graphical analysis using simplex geometry (e.g., ternary plots when $m=3$ ).

3. Estimation via Compositional Kernel Dimension Reduction (CKDR)

To estimate the central compositional subspace from data, the compositional kernel dimension reduction (CKDR) method is introduced. This method operates by optimizing a loss that measures the conditional independence between $Y$ and $X$ given $PX$ , using reproducing kernel Hilbert space (RKHS) conditional covariance operators.

The objective function is

$T(P) = \operatorname{Tr} \left( \Sigma_{YY|PX} \right)$

where $\Sigma_{YY|PX}$ is the conditional covariance of $Y$ given the reduced composition $PX$ .

In practice, empirical estimation uses centered Gram matrices $G_{PX}$ and $G_Y$ , and a ridge-regularized objective: $\underset{P \in \mathcal{M}_{m,d}}{\mathrm{minimize}} \; \operatorname{Tr}\left[ (G_{PX} + n\epsilon_n I)^{-1} G_Y \right]$ with regularization parameter $\epsilon_n > 0$ , solved by projected gradient descent with simplex projections for each column of $P$ .

CKDR thus yields an estimator $\hat{P}_n$ whose row space approximates $\mathcal{C}_{Y|X}$ . The framework is explicitly designed to accommodate zeros and avoids distortions from log-ratio transforms or ad hoc zero handling.

4. Theoretical Guarantees: Consistency and Sparsity

The estimator $\hat{P}_n$ is shown to be consistent for $\mathcal{C}_{Y|X}$ under standard regularity conditions and suitable decay of $\epsilon_n$ (e.g., $\epsilon_n \to 0$ with $n^{1/2}\epsilon_n \to \infty$ as $n\to\infty$ ). The convergence of the estimated subspace can be quantified via the chordal distance metric: $\rho^2(V, W) = \frac{\|\Pi_V - \Pi_W\|_F^2 - |\dim(V) - \dim(W)|}{2 \min \{\dim(V), \dim(W)\}}$ where $\Pi_V$ and $\Pi_W$ are projection matrices onto $V$ and $W$ , respectively.

Due to the geometry of $\mathcal{M}_{m,d}$ , the estimated reduction $\hat{P}_n$ is typically sparse: most columns have nearly all mass on a single or small number of rows, revealing latent groupings or patterns among the original variables. This inherent sparsity often exposes meaningful amalgamations and does not require imposing explicit sparsity penalties.

5. Visual and Interpretative Implications

A salient advantage of compositional SDR via central compositional subspaces is dual interpretability. For reductions to dimension $m=3$ , the projection $PX$ lives in a two-dimensional simplex and is naturally visualized via ternary plots, facilitating clear geometric discrimination among groups (e.g., case vs. control in biomedical data).

Simultaneously, each column of the reduction matrix $P$ is itself a composition and can be displayed on a ternary plot (“variable allocation plot”). This plot reveals which original variables most contribute to each low-dimensional amalgamation, allowing direct substantive interpretation.

This dual visualization approach eases the understanding of both the reduced data structure and the meaning of the compression itself—enabling direct graphical exploration of complex, high-dimensional compositional patterns without relying on axis-rotated or transformed data.

6. Applications and Practical Relevance

The central compositional subspace framework, with estimation via CKDR, is particularly apt for high-dimensional compositional data domains such as human microbiome, geochemistry, ecology, and genomics, where interpretability and adherence to the simplex constraint are paramount.

For example, in analyses of pediatric Crohn’s disease ileum microbiome data, CKDR-based ternary plots of projected samples distinguished disease from healthy groups, and variable allocation plots linked certain clusters of microbial taxa to disease. In vaginal microbiome studies predicting Nugent score, similar approaches revealed which taxa are overrepresented in different diagnostic groups.

The methodology yields interpretable, sparse compressions that directly identify meaningful compositions underlying biological phenomena, providing both graphical and statistical clarity.

7. Comparison with Classical and Contemporary Approaches

Classical dimension reduction techniques—including PCA applied to log-ratio transformed data—typically violate compositional constraints and can both distort data and create ill-posedness at the boundary (zero) points, often requiring ad hoc zero imputation procedures that compromise interpretability.

The central compositional subspace approach circumvents these problems by:

Avoiding extra transformations (operating directly in the simplex).
Utilizing reduction mappings that are column-stochastic, maintaining the compositional geometry.
Delivering an identifiable, minimal, and interpretable subspace.
Yielding estimators with sparsity that directly reveal underlying amalgamations, without external penalization.

A plausible implication is that future work in compositional data analysis and multi-view learning may benefit by adopting compositional subspace methodologies, both for interpretability and for robust handling of zeros and sparsity.

Summary Table: Central Compositional Subspace Specification

Aspect	Classical SDR	Compositional SDR with Central Subspace
Reduction Matrix	Unconstrained linear	Column-stochastic (simplex-respecting)
Existence of Central Subspace	Possible	Nontrivial only under compositional SDR
Interpretability	Difficult	Direct (compositional, sparse)
Zero Handling	Problematic	Naturally accommodated
Visualization	Indirect	Dual ternary plots (data & allocation)

The central compositional subspace thus formalizes a theoretically robust, geometrically sound, and interpretability-driven paradigm for dimension reduction in compositional data, enabling both rigorous statistical inference and intuitive analysis of high-dimensional problems where compositionality is fundamental (Park et al., 6 Sep 2025).

PDF Markdown Chat (Pro)

References (1)

Interpretable dimension reduction for compositional data (2025)