Compositional Sparsity in ML

Updated 18 May 2026

Compositional sparsity is defined by decomposing complex, high-dimensional functions into low-dimensional, simple components that interact over a sparse subset of inputs.
It leverages bounded connectivity, sparse activation mechanisms, and regularization techniques to achieve robust generalization and efficient approximation in neural networks and statistical models.
This principle improves parameter efficiency, interpretability, and statistical recovery in diverse applications such as deep learning, metric learning, compositional data analysis, and generative modeling.

Compositional sparsity refers to a suite of inductive biases, modeling principles, and algorithmic mechanisms that enforce or exploit the principle that high-dimensional functions, statistical models, or neural representations can often be decomposed into relatively simple, reusable components that interact over only a small subset of all available inputs or features. This structure facilitates robust generalization, interpretability, and efficient approximation in domains as varied as deep learning, metric learning, compositional data analysis, generative modeling, and statistical testing.

1. Mathematical Formalizations and Core Principles

Compositional sparsity is grounded in the hypothesis that real-world target functions, model parameters, or interactions are not uniformly dense in all variables, but instead organize as compositions of low-dimensional constituents. Formally, a function $f:\mathbb{R}^d \to \mathbb{R}$ is compositionally sparse if it admits a representation such as

$f(x_1,\dots,x_d) = h\big(g_1(x_{U_1}),\ldots,g_m(x_{U_m})\big)$

where each $U_i \subseteq \{1,\dots,d\}$ , $|U_i| \ll d$ , and $g_i$ and $h$ are typically simple (e.g., smooth or low-degree polynomial) functions (Lin et al., 14 May 2026, Danhofer et al., 3 Jul 2025, Dahmen, 2022). In neural architectures, this is reflected in bounded-degree DAGs where each neuron (or computational unit) receives input from $O(1)$ predecessors, independently of the total width or depth (Galanti et al., 2023).

In statistical settings, compositional sparsity may refer to covariance or precision matrices with sparse, low-rank, or block structures, or to regression/loadings matrices with many zeros, supporting the idea that only a small subset of covariates or directions drive the outcome (Zhang et al., 2023, Zhang et al., 2021, Mishra et al., 2019). In generative modeling and hierarchical architectures, it manifests as AND-OR grammars with sparse activations at each level (Xing et al., 2019, Tubiana et al., 2016).

2. Architectures and Algorithmic Mechanisms

Neural networks with compositional sparsity are typically constructed by:

Designing connectivity graphs (DAGs) with bounded in-degree or fixed local receptive fields, as in CNNs, sparse MLPs, or architectures derived from estimated dependency structures (e.g., Information Filtering Networks feeding Homological Neural Networks) (Lin et al., 14 May 2026, Galanti et al., 2023).
Imposing attention mechanisms or gating structures that select only a few objects, slots, or features at each module invocation (Spies et al., 2022, Spilsbury et al., 2022, Chen et al., 25 Nov 2025).
Implementing sparsity-inducing operations such as Top- $k$ selection, sparsemax, Gumbel-softmax, or (non-)convex regularization penalties ( $\ell_1$ , entropy, group-lasso) at the level of inputs, parameters, activations, or weight matrices (Spies et al., 2022, Galanti et al., 2023, Shi et al., 2014, Xing et al., 2019).
In the context of attention or Transformer-based models, recasting attention as a layer of sparse coding with explicit coefficient sparsity and context transfer mechanisms (Chen et al., 25 Nov 2025).

Table: Mechanisms used to promote compositional sparsity.

Domain	Mechanism	Canonical Example
Neural Networks	Bounded in-degree, local receptive fields	CNN, HNN, DAG-Nets
Attention	Sparsemax, Gumbel-softmax, sparse coding	Slot/PrediNet, SCoT
Regression/Loss	$\ell_1$ norm, group-lasso, entropy	SCML, RobRegCC, PCA
Generative Models	Top- $f(x_1,\dots,x_d) = h\big(g_1(x_{U_1}),\ldots,g_m(x_{U_m})\big)$ 0 activation, AND-OR grammar	Hierarchical Gen. Net
Statistics	Covariance/precision sparsity, block structure	CARE, S $f(x_1,\dots,x_d) = h\big(g_1(x_{U_1}),\ldots,g_m(x_{U_m})\big)$ 1 algebra

3. Generalization, Approximation, and Statistical Guarantees

Compositional sparsity provides both approximation-theoretic and generalization gains:

Approximation: For functions that are compositionally sparse (i.e., can be written as compositions of $f(x_1,\dots,x_d) = h\big(g_1(x_{U_1}),\ldots,g_m(x_{U_m})\big)$ 2 low-arity subfunctions), deep architectures matching the compositional structure can achieve a given error $f(x_1,\dots,x_d) = h\big(g_1(x_{U_1}),\ldots,g_m(x_{U_m})\big)$ 3 with only $f(x_1,\dots,x_d) = h\big(g_1(x_{U_1}),\ldots,g_m(x_{U_m})\big)$ 4 parameters, whereas shallow or dense networks require $f(x_1,\dots,x_d) = h\big(g_1(x_{U_1}),\ldots,g_m(x_{U_m})\big)$ 5—the curse of dimensionality (Danhofer et al., 3 Jul 2025, Dahmen, 2022, Lin et al., 14 May 2026). This sharp improvement critically depends on the sparsity of interactions per layer or module, not just overall width or depth.
Generalization: Rademacher complexity and covering number bounds for compositionally sparse neural networks scale with the maximal degree or number of nonzero connections per layer—often $f(x_1,\dots,x_d) = h\big(g_1(x_{U_1}),\ldots,g_m(x_{U_m})\big)$ 6, which improves exponentially over dense parameter-count or global norm-based bounds (Galanti et al., 2023). In metric learning, restricting to sparse combinations of local metrics or basis elements reduces overfitting and yields generalization bounds that depend only on the number of active bases, not the entire candidate pool (Shi et al., 2014).
Statistical Recovery: In compositional data analysis, estimation error in large covariance/precision matrices or principal subspaces can be made minimax optimal if the true underlying objects are row/column-sparse, even when only aggregate/relative data are observed (Zhang et al., 2023, Zhang et al., 2021, Jiang et al., 2024). This underpins the “blessing of dimensionality” phenomenon: identification error vanishes as $f(x_1,\dots,x_d) = h\big(g_1(x_{U_1}),\ldots,g_m(x_{U_m})\big)$ 7 under suitable sparsity scaling.

4. Empirical Findings and Interpretability

Imposing compositional sparsity leads to empirical benefits beyond sample efficiency:

Generalization: Feature sparsity regularizers on relational modules in object-centric models improve held-out relational reasoning performance and yield simpler, more interpretable soft rules, as measured by the simplicity of post-hoc decision trees fitted to model outputs (Spies et al., 2022).
Representational Disentanglement: In RBMs and hierarchical generative models, sparsity induces a compositional phase where distributed, interpretable “parts” emerge at hidden layers. In natural language/gen RL agents, attention and embedding sparsity enables true compositional generalization to novel attribute combinations (Tubiana et al., 2016, Xing et al., 2019, Spilsbury et al., 2022).
Parameter Efficiency and Robustness: Homological Neural Networks, with fixed wiring reflecting inferred sparse dependency graphs, match or outperform far larger dense baselines on both synthetic and real-world tabular regression with drastically fewer parameters and lower variance (Lin et al., 14 May 2026). In compositional data dimension reduction, the reduction matrix often becomes sparse without any explicit penalty, revealing interpretable groupings and amalgamations (Park et al., 6 Sep 2025).
Failure Cases: Overly strong sparsity may hurt performance (via excessive loss of information or missed object slots), and imperfect upstream object representations can dominate the model’s error profile (Spies et al., 2022).

5. Algorithms and Optimization Procedures

Several algorithmic paradigms are prevalent for learning with compositional sparsity:

Proximal Stochastic Optimization: For convex losses with $f(x_1,\dots,x_d) = h\big(g_1(x_{U_1}),\ldots,g_m(x_{U_m})\big)$ 8, group-lasso, or mixed norm penalties (as in SCML, RobRegCC, sparse PCA), regularized dual averaging, ISTA/FISTA, and linearized ADMM are efficient and provably convergent (Shi et al., 2014, Mishra et al., 2019, Zhang et al., 2021).
Sparse Attention/Coding: In SCoT, each attention block combines a soft-thresholding (proximal) operator to enforce coefficient sparsity and linear coefficient transfer to enable in-context compositional generalization (Chen et al., 25 Nov 2025). In slot-based architectures, sparsemax/Gumbel-softmax provides exact or approximate hard attention over objects.
Composable Algebraic Design: The S $f(x_1,\dots,x_d) = h\big(g_1(x_{U_1}),\ldots,g_m(x_{U_m})\big)$ 9 framework expresses all structural sparsity patterns as compositions of tensor views, block layouts, local scopes, and cross-tensor coupling, enabling unified second-order saliency and pruning across fine-grained and block-level granularity (Ghriss, 13 Apr 2026).
Hierarchical Top- $U_i \subseteq \{1,\dots,d\}$ 0 Selection: In generative AND-OR models, sparsity is enforced by hard top- $U_i \subseteq \{1,\dots,d\}$ 1 selection post-activation at each layer, yielding interpretable compositional hierarchies without additional regularization (Xing et al., 2019).

6. Practical Applications and Theoretical Implications

Compositional sparsity has found utility across machine learning, statistics, and scientific computation:

Relational reasoning: Sparse object-centric models learn reusable rules, albeit sensitive to the quality of object discovery (Spies et al., 2022).
Metric learning: Global, multi-task, and local compositional sparsity yields scalable, generalizable metric spaces with state-of-the-art empirical performance (Shi et al., 2014).
Robust regression and dimension reduction: Penalties enforcing loadings and mean-shift sparsity enable outlier-robust inference and interpretability in compositional settings (Mishra et al., 2019, Park et al., 6 Sep 2025).
Compositional data analysis: Sparse precision estimation and subspace recovery in high dimensions underpin modern compositional network inference and PCA (Zhang et al., 2023, Zhang et al., 2021).
Function approximation and PDEs: Intrinsic compositional sparsity in parameter-to-solution maps for complex PDEs enables DNNs to avoid the curse of dimensionality under only moderate smoothness (Dahmen, 2022).
Grounded language/vision RL: Factored, sparse-attention architectures with $U_i \subseteq \{1,\dots,d\}$ 2-disentangled word-to-attribute mappings yield robust compositional generalization in emergent-language agents (Spilsbury et al., 2022).
Transformer-based models: Explicit sparse coding layers in attention, with compositional transfer of coefficients, solve in-context compositional problems that defeat standard dense Transformer baselines (Chen et al., 25 Nov 2025).

7. Limitations, Trade-offs, and Open Directions

While compositional sparsity is a pervasive beneficial bias, open challenges persist:

Model selection and hyperparameter tuning: Overly strong sparsity can reduce accuracy or leave critical interactions unmodeled; the optimal degree of sparsity is often data and task-dependent (Spies et al., 2022).
Failure with imperfect representations: Hierarchical or slot-based modules are sensitive to upstream failures in object/part discovery. Bypass paths, auxiliary losses for slot coverage, or fallback to dense features are suggested mitigations (Spies et al., 2022).
Theory gaps: Sharp theoretical understanding of which compositional DAG structures are learnable under realistic SGD and limited supervision, and what characterizes the recoverability of sparse support in end-to-end settings, remains incomplete (Danhofer et al., 3 Jul 2025).
Scalability and computation: Some data-driven structure learning procedures (e.g., IFN/MFCF in HNNs, full joint Hessian in S $U_i \subseteq \{1,\dots,d\}$ 3) scale polynomially with ambient dimension but may become costly at very large $U_i \subseteq \{1,\dots,d\}$ 4 without further approximation (Lin et al., 14 May 2026, Ghriss, 13 Apr 2026).
Interactions with modern architectures: Extensions to residual, attention-based, and superposition-rich settings (Transformers, multi-task DNNs) are ongoing research directions (Lin et al., 14 May 2026, Danhofer et al., 3 Jul 2025, Chen et al., 25 Nov 2025).

A comprehensive theory of modern machine learning—and, by extension, intelligence—must treat compositional sparsity as a foundational principle: it explains why deep, locally structured networks generalize intractably high-dimensional functions, guides the construction of interpretable, robust, and efficient models, and sets the stage for the next generation of architecture-aware, structure-exploiting algorithms.