Deep Networks Learn Deep Hierarchical Models

Published 1 Jan 2026 in cs.LG and cs.AI | (2601.00455v1)

Abstract: We consider supervised learning with $n$ labels and show that layerwise SGD on residual networks can efficiently learn a class of hierarchical models. This model class assumes the existence of an (unknown) label hierarchy $L_1 \subseteq L_2 \subseteq \dots \subseteq L_r = [n]$, where labels in $L_1$ are simple functions of the input, while for $i > 1$, labels in $L_i$ are simple functions of simpler labels. Our class surpasses models that were previously shown to be learnable by deep learning algorithms, in the sense that it reaches the depth limit of efficient learnability. That is, there are models in this class that require polynomial depth to express, whereas previous models can be computed by log-depth circuits. Furthermore, we suggest that learnability of such hierarchical models might eventually form a basis for understanding deep learning. Beyond their natural fit for domains where deep learning excels, we argue that the mere existence of human teachers" supports the hypothesis that hierarchical structures are inherently available. By providing granular labels, teachers effectively revealhints'' or ``snippets'' of the internal algorithms used by the brain. We formalize this intuition, showing that in a simplified model where a teacher is partially aware of their internal logic, a hierarchical structure emerges that facilitates efficient learnability.

Abstract PDF Upgrade to Chat

Summary

The paper demonstrates that deep residual networks trained with layerwise SGD can provably learn complex, deep hierarchical models.
Explicit theoretical bounds reveal that polynomial-depth networks reduce sample and time complexity compared to shallow models.
The analysis underscores the significance of hierarchical label decompositions, aligning with practical insights in deep learning.

Deep Networks and Their Provable Capacity to Learn Deep Hierarchical Models

Introduction and Context

The paper "Deep Networks Learn Deep Hierarchical Models" (2601.00455) presents a structural and algorithmic explanation for why deep neural networks—specifically, residual networks trained by layerwise stochastic gradient descent (SGD)—can efficiently learn a class of hierarchical models. The core motivation is to provide a theoretical foundation matching the representational and computational successes of deep learning in high-dimensional tasks, going beyond previously established learnability results (e.g., for linear models or shallow non-linear models).

Unlike most prior theoretical analyses, which establish learnability for function classes expressible by shallow or logarithmic-depth circuits, this paper constructs an explicit class of hierarchical models that can require polynomial (and thus arbitrarily large) depth for efficient representation and learning. The analysis yields bounds for sample and time complexity and establishes that the effective learnability arises from an implicit hierarchical decomposition present in the labels and learned functions.

Formal Model of Hierarchical Functions

The model considers the supervised, multilabel learning scenario. Given $n$ possible labels, the ground truth function $f^*: \mathcal{X} \to \{\pm 1\}^n$ is characterized by a hierarchy $L_1 \subseteq L_2 \subseteq \dots \subseteq L_r = [n]$ over the labels, where:

Labels in $L_1$ are simple functions (polynomial threshold functions, PTFs) of the input.
For $i > 1$ , every label in $L_i$ is itself a simple (PTF) function of some subset of the labels in $L_{i-1}$ .

The significance of this model is that, by controlling $r$ , $K$ , and the label dependency, one can generate functions with polynomial hierarchical depth, vastly surpassing the shallower models previously shown to be efficiently learnable by SGD or kernel-based approaches.

The model also naturally captures the kind of multilevel abstraction intuitively believed to exist in domains such as images, speech, and text, aligning with both deep learning practice and cognitive science perspectives on human concept acquisition.

Main Algorithmic and Theoretical Results

The paper provides a constructive proof that layerwise SGD on a specifically initialized and structured residual network (ResNet) with an appropriate depth can efficiently learn any hierarchical model of this type—without explicit knowledge of the underlying label hierarchy. The main theorem states that, for any target function $f^*$ admitting a hierarchical decomposition of complexity $(r, K, M, B, \xi)$ , there exists a layerwise SGD procedure over a sufficiently deep ResNet architecture that can learn it to arbitrary accuracy with polynomial sample and computational complexity. Specifically, for depth $D \ge r \cdot O(B+1/\xi)$ and hidden width $q = \tilde O((M+1)^4 (wn)^{2K}/\gamma^{4+2K})$ , the sample and parameter count for $O(1/m)$ -consistency are both polynomial in relevant parameters.

Key structural and quantitative claims include:

The learned function class strictly surpasses circuits of log-depth: there are hierarchical functions with polynomial depth (and thus exponential size when written as shallow circuits) that are efficiently learnable by this algorithm.
Previous models learnable by SGD, including those corresponding to kernels/random features and shallow or bounded-depth networks, are strict subclasses of this hierarchical family.
The learnability is robust: even under scenarios where each label depends on only a bounded number of simpler labels (examples akin to realistic compositional features), the class is efficiently learnable with substantially improved sample and network complexity bounds.

Algorithmic Insights

The core algorithm alternates between representation learning and prediction at each layer:

Each network layer learns to approximate all labels at the current hierarchy level, using random features and PTFs built over the representations from previous layers.
The proof of learnability uses explicit random feature embeddings and polynomial approximation theory, showing that any robust PTF can be realized to high accuracy by random features-based predictors with polynomial width.
The layerwise training principle ensures that as features and labels at earlier levels become accurate, subsequent layers iteratively enable exact learning for deeper (semantically higher) labels.

The analysis is supported by precise generalization results leveraging parameter counting and Lipschitz continuity, as well as robust margin-based error controls to guarantee arbitrarily low classification error in polynomial resources.

Human Supervision, Implicit Hierarchies, and Learning Theory

This work advances a strong claim that the existence of human "teachers" and fine-grained labeling likely implies hierarchical structure in real-world data—a point reinforced by the observation that modern large-scale learning datasets provide labels at various levels of abstraction (e.g., objects, parts, local features).

The "brain dump" model introduced in the paper formalizes this intuition: if an annotator can provide labels corresponding to arbitrary PTFs or ensemble features/functions over intermediate neural activations, then the resulting target function admits a low-complexity hierarchical decomposition. The extension to randomized ensembles and spatial/sequence models covers practical scenarios such as image segmentation or sequence labeling in NLP.

An important meta-theoretical implication is drawn: efficient learnability of highly rich function classes is enabled not despite, but because of, the hierarchical label structure imposed by the nature of teaching and supervision, side-stepping worst-case computational hardness results that plague general PAC and agnostic learning for deep circuits.

This theoretically motivates why deep learning succeeds in practice: the world and human-generated data inherently possess additional structure (available through hierarchies and "hints"), which is efficiently exploitable by deep architectures.

Connections, Extensions, and Future Research

The hierarchical learnability theory builds on and extends several previous lines of work, including early results on kernel/NTK learnability, theoretical analyses of deep circuit classes (e.g., [mossel2016deep], [koehler2022reconstruction], [huang2025optimal], [li2025noise]), and works interpreting deep learning through the lens of compositional and hierarchical function spaces (e.g., [mhaskar2017when], [bruna2013invariant], [wang2025learning]).

The paper identifies critical directions for future work, such as:

Extending hierarchical model analysis to capture single-function hierarchies and the role of curriculum or self-supervised learning.
Theoretical analysis of attention and non-local mechanisms within this framework.
Identifying and exploiting explicit hierarchical decompositions in known algorithms and empirically learned models, potentially enabling direct interpretability.
Empirical tests to validate the hypothesis that real-world datasets exhibit recoverable hierarchical structure, and development of algorithms that actively seek or use such decompositions.

There are also limitations and technical assumptions to be addressed: strong supervision (all positive labels per instance), layerwise rather than joint training, output layer orthogonality, and non-standard loss constructions. The extension to partial/hard supervision, alternative optimization protocols, and more natural data regimes remains an open field.

Conclusion

This paper establishes that deep residual architectures with layerwise SGD can provably and efficiently learn highly expressive hierarchical function classes, matching the practical and representational capacity of deep learning. By formalizing hierarchical compositionality as the key enabler of tractable learning in deep networks, the work provides a rigorous framework for explaining the empirical success of deep models, clarifies the role of supervision, and motivates both theoretical and practical advances directed at leveraging, recovering, and exploiting hierarchical structure in data and network design (2601.00455).

Markdown