End-to-End Learning Capacity

Updated 29 May 2026

End-to-end learning capacity is the measure of task complexity, network depth, and parameter count under which joint optimization via gradient descent remains effective.
The paper demonstrates that exceeding this capacity causes exponential increases in epochs needed for convergence and introduces gradient interference issues.
Practical strategies like hybrid training, capacity-aware dataset design, and architectural reparameterization are recommended to extend effective capacity in real-world applications.

End-to-end learning capacity characterizes the practical limits of how much functional complexity, network depth, parameter count, or modular compositionality a differentiable system can absorb through joint gradient-based optimization before learning degrades, stagnates, or fails. This capacity is not an abstract model class expressiveness in the sense of VC-dimension or universal approximation, but the largest scale for which a specific end-to-end (E2E) architecture, trained by standard stochastic gradient descent (SGD) or adaptive variants, will reliably converge in a feasible time on real data. Across domains—vision, reinforcement learning, quantum metrology, and communications—studies have sought to empirically and theoretically delineate the boundaries and signatures of E2E learning capacity.

1. Definitions and Core Notions

End-to-end learning refers to a paradigm in which all components of a machine learning system are jointly optimized via backpropagation, treating the entire architecture as a single, differentiable computation graph. "End-to-end learning capacity" is thus informally the tuple of task complexity, network depth $D$ (or modular depth $N$ ), and parameter count $P$ for which E2E learning remains tractable and effective. A system exceeds its E2E capacity when the compounded computational and statistical burdens impede gradient-based optimization: this may manifest as exponentially slowed convergence, loss spikes, or total collapse into trivial solutions (Glasmachers, 2017).

The operational capacity metric varies by context:

In classical domains, it is the scale $(N, D, P)$ for which SGD reaches target accuracy in acceptable epochs.
In quantum sensing, capacity is formalized as the maximum fraction of function space that can be reliably inferred per fixed measurement resources, normalized as $C[f^*]$ or $\mathrm{REC}(S)$ (Ilias et al., 23 Dec 2025).
In time-dependent prediction, effective E2E capacity is shaped by not only model size but also dataset properties, such as temporal sampling frequency, which modulate the information and redundancy burden presented to the network (Liu et al., 11 May 2026).

2. Theoretical Characterization and Failure Modes

Glasmachers (Glasmachers, 2017) provides two central theoretical arguments for capacity limitations in E2E gradient descent:

Gradient Interference: When the system is architected from specialized modules, E2E loss backpropagates gradients that are not aligned with module roles, blurring distinct module objectives. Early/central modules receive gradients shaped by the (randomly initialized, ill-trained) downstream stack, resulting in random walk trajectories in parameter space and an increased likelihood of convergence to poor local minima or saddle points.
Ill-Conditioning with Depth: For a composed system $f(x; W) = M_N \circ \ldots \circ M_1(x)$ , the Hessian $\nabla^2 L$ of the total loss typically shows exponential increase in its condition number $\kappa$ as $N$ grows. Theoretical rates of convergence scale as $N$ 0. Empirically, even for simple tasks (e.g., stacked identity mapping with $N$ 1 modules and only 470 trainable parameters), the number of epochs to zero error grows as $N$ 2, with practical failure by $N$ 3 (Glasmachers, 2017).

Table: E2E Capacity Breakdown (Identity Mapping Stack, (Glasmachers, 2017)) | N (modules) | Params | Epochs to Conv. (median) | Success Rate | |--------------|---------|-------------------------|-------------------| | 1 | 94 | dozens | 100% | | 2 | 188 | hundreds | 100% | | 3 | 282 | 1,000s–10,000s | 100% | | 4 | 376 | $N$ 4100,000 | 100% | | 5 | 470 | fail ( $N$ 5 limit) | 20% |

Similar exponential degradation arises for MNIST classifiers artificially stacked with bottleneck modules and multi-module planning agents in grid environments.

Principal empirical failure mechanisms are:

Catastrophic gradient “jamming” where early modules cannot escape poor initializations due to uninformative, noisy, or misaligned upstream gradients.
Exponential sample complexity required to jointly explore the parameter manifold of multiple interacting modules.

Hybrid training schemes (pretraining, module-wise freezing, and staged integration) can extend the effective E2E capacity by better leveraging modularity (Glasmachers, 2017).

3. Information-Theoretic Perspectives

E2E training induces layerwise information bottlenecks and role differentiation inaccessible to layerwise (greedy) or local training. Analysis using the normalized Hilbert-Schmidt Independence Criterion (nHSIC) reveals that:

E2E-trained networks show dynamic transitions across the information plane: early layers preserve input information ( $N$ 6 rises), middle layers compress to retain only task-relevant details (fall in $N$ 7 but plateau in label dependence $N$ 8), and final layers focus on discriminative feature extraction (Sakamoto et al., 2024).
Layer-role differentiation allows for cooperative specialization: early layers act as encoders, middle layers as compressors/entanglers, final layers as discriminators.
Layerwise training, in contrast, yields uniform compression with loss of label information and lower test accuracy. For example, ResNet18 on CIFAR10 shows $N$ 9 (input/label nHSIC) for middle E2E layers versus $P$ 0 for layerwise after 50 epochs; E2E converges to $P$ 1 versus $P$ 2 for layerwise.

The information bottleneck principle [Tishby et al.] is empirically reflected in E2E trajectories, with late-layer nHSIC showing a compression plateau for input information but sustained label relevance (Sakamoto et al., 2024).

4. Domain-Specific Manifestations and Scaling

Optical Communications

End-to-end optimization, as applied to optical intensity-modulation/direct-detection (IM/DD) links, leverages coupled transmitter, channel, and receiver parameterizations as a single neural network. E2E-trained IM/DD links have achieved sub-HD-FEC bit error rates at up to 42 Gb/s over 40 km, substantially outperforming classic pulse amplitude modulation (PAM2/4) with feedforward equalization in both simulation and experiment (Karanov et al., 2018). While explicit Shannon capacity bounds are not given, end-to-end deep learning enables the system to approach the nonlinear channel capacity better than modular approaches with separate optimizations.

Temporal Perception and Trajectory Prediction

In end-to-end trajectory prediction for autonomous driving, capacity is shaped not just by parameter count but by the interaction with temporal sampling frequency. Increasing frame rate can burden models with redundant or off-manifold visual content, exceeding the capacity of small networks ( $P$ 3– $P$ 4M parameters), which often reach minimal average displacement error (ADE) at intermediate frequencies (6–10 Hz). Larger models (AutoVLA, $P$ 5B), with substantially higher capacity, exploit increased sampling monotonically, attaining best accuracy at the highest available frequency. This demonstrates a capacity-frequency trade-off: dense sampling is only advantageous if model capacity can absorb the redundant/noisy visual content without degradation (Liu et al., 11 May 2026).

Action Recognition and Fine-Grained Perception

In video-based fitness activity recognition, end-to-end architectures pre-trained on large, fine-grained datasets achieve or surpass state-of-the-art pose-based pipelines for both coarse (40-way class) and fine (real-time repetition counting) tasks. The capacity bottleneck is primarily the scale and granularity of pre-training: with identical architectures, BigFitness pre-training delivers +20–30pp over ImageNet or Kinetics. Furthermore, end-to-end models can produce framewise temporal phase predictions in real time at much lower compute cost than cascaded pose extraction pipelines. The capacity is thus determined both by model size and specialized pre-training (Mercier et al., 2023).

Reinforcement Learning and Emergent Functions

Empirical studies in end-to-end RL from Shibata and collaborators demonstrate that sufficiently wide/deep static (feed-forward) NNs and recurrent NNs (RNNs) can support a broad spectrum of emergent functions—from image recognition to temporal abstraction, attention, memory formation, exploration, and elementary communication—if and only if architectural capacity matches the task’s sensorimotor, memory, and temporal requirements. When network or state dimensions are undersized, or when RNNs are not initialized close to identity (preserving temporal gradients), emergent functions and long-range dependencies do not materialize (Shibata, 2017). No formal scaling laws are provided; operational capacity is inferred from successful emergence as the task’s degrees of freedom scale.

5. Parametric and Functional Factors Affecting E2E Capacity

Architectural Decomposition and Parameter Efficiency

Empirically, hard parameterization constraints—such as separable 1D convolutions in DecomposeMe (Alvarez et al., 2016)—can increase effective E2E learning capacity by allowing deeper, more expressive networks within fixed parameter budgets. DecomposeMe achieves equivalent or improved accuracy with up to 92% fewer parameters compared to standard VGG-B on Places2, leveraging the increased "effective depth" from inserted nonlinearities. This demonstrates that architectural tricks can extend the practical boundary of E2E capacity by enabling more expressive computations per learned weight.

Information and Noise Scaling in Quantum Inference

In quantum machine learning and metrology, capacity is quantified via the Resolvable Expressive Capacity (REC), defined as the maximum normalized Bayesian risk improvement over the prior. For a linear estimator with $P$ 6 shots, the REC is given by

$P$ 7

where $P$ 8 is the Gram matrix of features and $P$ 9 is the average shot-noise covariance; $(N, D, P)$ 0 are generalized eigenvalues capturing the noise-robust feature spectrum. The practical implication is that, for $(N, D, P)$ 1-qubit quantum sensors, only $(N, D, P)$ 2 or $(N, D, P)$ 3 eigentasks remain noise-robust at finite $(N, D, P)$ 4, tightly bounding the practically achievable E2E capacity (Ilias et al., 23 Dec 2025).

6. Capacity Optimization: Strategies and Practical Recommendations

To extend the effective E2E capacity of complex modular systems, several strategies are advised:

Hybrid/structured training: Pretrain modules (or small module clusters) individually with isolated objectives, freeze when stable, and then fine-tune jointly with the global loss—balancing modular tractability and E2E expressiveness (Glasmachers, 2017).
Capacity-aware dataset design: For perceptual sequence tasks, tune temporal sampling frequency relative to model size; for small models, intermediate $(N, D, P)$ 5 may optimize prediction accuracy and stability (Liu et al., 11 May 2026).
Information bottleneck regularization: Cohesively encourage middle and late layers to compress input entropy while sustaining task-relevant information (as revealed by information plane analysis) (Sakamoto et al., 2024).
Architectural reparameterization: Employ low-rank/structured layers (e.g., DecomposeMe) to maximize per-weight expressiveness and conserve memory/compute (Alvarez et al., 2016).
Functional eigenspace reduction: In quantum or hybrid statistical sensing, restrict inference to the principal eigentasks robust to shot noise, thus matching network width to the effective information bandwidth supported by system resources (Ilias et al., 23 Dec 2025).

7. Limitations and Future Perspectives

E2E learning capacity cannot be fully characterized by classical model expressiveness (e.g., approximation theory, VC-dimension). Scalability is instead dictated by optimization dynamics, gradient alignment, and the compounded statistical complexity of modular composition (Glasmachers, 2017).
No universal scaling laws have yet emerged. Empirical limits vary by domain, architecture, dataset, and optimization protocol.
In reinforcement learning and dynamic tasks, capacity is particularly sensitive to architecture-specific initialization (e.g., RNN Jacobians near identity), the size of recurrent states, and the richness of the reward structure and training distribution (Shibata, 2017).
Open questions remain regarding the precise mapping between network parameterization, module count, statistical sample budget, and emergent E2E functional capacity, especially in large-scale, safety-critical, or nonstationary settings.

End-to-end learning capacity remains a critical and nuanced property, reflecting the intersection of system architecture, task structure, statistical learning theory, and optimization landscape. Exploiting this capacity requires both architectural and methodological commitments to jointly optimize expressiveness, efficiency, and stability in the face of modular complexity.