Position: A Theory of Deep Learning Must Include Compositional Sparsity (2507.02550v1)

Published 3 Jul 2025 in cs.LG and cs.AI

Abstract: Overparametrized Deep Neural Networks (DNNs) have demonstrated remarkable success in a wide variety of domains too high-dimensional for classical shallow networks subject to the curse of dimensionality. However, open questions about fundamental principles, that govern the learning dynamics of DNNs, remain. In this position paper we argue that it is the ability of DNNs to exploit the compositionally sparse structure of the target function driving their success. As such, DNNs can leverage the property that most practically relevant functions can be composed from a small set of constituent functions, each of which relies only on a low-dimensional subset of all inputs. We show that this property is shared by all efficiently Turing-computable functions and is therefore highly likely present in all current learning problems. While some promising theoretical insights on questions concerned with approximation and generalization exist in the setting of compositionally sparse functions, several important questions on the learnability and optimization of DNNs remain. Completing the picture of the role of compositional sparsity in deep learning is essential to a comprehensive theory of artificial, and even general, intelligence.

Summary

The paper posits that deep networks leverage compositional sparsity to represent complex functions with sparse input dependencies.
It demonstrates that exploiting sparse representations mitigates the curse of dimensionality, enabling efficient approximation and optimization.
Architectures like CNNs and transformers implicitly capture these compositional structures to boost performance and generalization.

Position: A Theory of Deep Learning Must Include Compositional Sparsity

Introduction

The paper discusses the critical role of compositional sparsity in understanding the success of Deep Neural Networks (DNNs) in overcoming the curse of dimensionality. Compositional sparsity refers to the ability of DNNs to represent complex functions as compositions of simpler constituent functions, each relying on a small subset of inputs. The paper contends that this property is shared by efficiently Turing-computable functions and is fundamental to the approximation, optimization, and generalization capabilities of DNNs.

Classical Learning and the Curse of Dimensionality

Traditional learning methods, including shallow neural networks, are susceptible to the curse of dimensionality, which poses significant challenges in learning high-dimensional functions. The curse manifests in requiring exponentially many parameters and samples to approximate and generalize complex functions accurately. Empirical Risk Minimization (ERM) frameworks illustrate that generalization bounds improve with the complexity constraint of the hypothesis class. However, shallow networks often struggle due to their inability to leverage the compositional sparsity inherent in many practical problems.

Compositional Sparsity in Deep Learning

The notion of compositional sparsity is built on the premise that efficient Turing-computable functions can be decomposed into sparse constituent functions. This decomposition results in a Directed Acyclic Graph (DAG) representation where constituent functions depend only on a limited number of variables, facilitating efficient approximation by DNNs.

The paper formalizes the concept within the framework of efficiently Turing-computable functions, which implies compositional sparsity. This theory supports why DNNs, unlike shallow networks, can avoid the exponential blowup associated with approximation complexity and are universal approximators for these function classes.

Learnability and Optimization Challenges

While compositional sparsity clarifies the potential for efficient representation in DNNs, learning these decompositions from data remains challenging. The paper reviews theoretical complexity barriers that limit the efficient learnability of arbitrary compositionally sparse functions, noting that specific subsets of these functions can be learned efficiently with suitable structural assumptions or supervisory hints.

Efficient architectures like Convolutional Neural Networks (CNNs) leverage compositional sparsity by constraining operations to local patches, improving optimization landscapes and generalization bounds. Other frameworks like transformers show empirical success by implicitly learning compositional structures through attention mechanisms and autoregressive models.

Universality of Autoregressive Predictors

The universality of autoregressive next-token predictors is explored, suggesting that training on datasets containing intermediate steps allows efficient learning of compositionally sparse functions. Chain-of-Thought (CoT) prompting further exemplifies this concept by decomposing complex reasoning tasks into manageable subproblems, leveraging sparse computations and enhancing learnability.

Open Questions and Future Work

The paper identifies several unresolved issues, including how deep networks discover compositional structures and the role of supervision in efficient learning. These insights demand further exploration, potentially informing more nuanced network designs or training paradigms that directly incorporate compositional assumptions.

Alternative Theories

Alternative perspectives such as manifold learning and the multi-index model provide different avenues for explaining the success of DNNs. Manifold learning postulates that data lies on low-dimensional manifolds, sidestepping the curse of dimensionality by simplifying the learning task. Similarly, the multi-index model suggests that DNNs effectively identify relevant features, processing hierarchical subproblems more efficiently.

Conclusion

In sum, compositional sparsity emerges as a unifying principle that accounts for the approximation, optimization, and generalization prowess of DNNs. By exploiting sparse function representations, deep networks can handle complex, high-dimensional tasks without succumbing to the curse of dimensionality. Future research will likely deepen the theoretical understanding and practical application of these concepts, driving advancements in AI efficiency and interpretability.