Conjugate Learning Theory: Uncovering the Mechanisms of Trainability and Generalization in Deep Neural Networks

Published 18 Feb 2026 in stat.ML, cs.AI, and cs.LG | (2602.16177v2)

Abstract: In this work, we propose a notion of practical learnability grounded in finite sample settings, and develop a conjugate learning theoretical framework based on convex conjugate duality to characterize this learnability property. Building on this foundation, we demonstrate that training deep neural networks (DNNs) with mini-batch stochastic gradient descent (SGD) achieves global optima of empirical risk by jointly controlling the extreme eigenvalues of a structure matrix and the gradient energy, and we establish a corresponding convergence theorem. We further elucidate the impact of batch size and model architecture (including depth, parameter count, sparsity, skip connections, and other characteristics) on non-convex optimization. Additionally, we derive a model-agnostic lower bound for the achievable empirical risk, theoretically demonstrating that data determines the fundamental limit of trainability. On the generalization front, we derive deterministic and probabilistic bounds on generalization error based on generalized conditional entropy measures. The former explicitly delineates the range of generalization error, while the latter characterizes the distribution of generalization error relative to the deterministic bounds under independent and identically distributed (i.i.d.) sampling conditions. Furthermore, these bounds explicitly quantify the influence of three key factors: (i) information loss induced by irreversibility in the model, (ii) the maximum attainable loss value, and (iii) the generalized conditional entropy of features with respect to labels. Moreover, they offer a unified theoretical lens for understanding the roles of regularization, irreversible transformations, and network depth in shaping the generalization behavior of deep neural networks. Extensive experiments validate all theoretical predictions, confirming the framework's correctness and consistency.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a novel conjugate learning theory that explains trainability and generalization through convex duality and gradient energy bounds.
It rigorously establishes Fenchel-Young losses as the unique, well-behaved loss functions while detailing the effects of architecture, SGD, and overparameterization.
Empirical results validate the theory across multiple architectures, linking design choices like skip connections to improved optimization and generalization.

Conjugate Learning Theory: Unified Principles of Deep Neural Network Trainability and Generalization

Introduction and Motivation

"Conjugate Learning Theory: Uncovering the Mechanisms of Trainability and Generalization in Deep Neural Networks" (2602.16177) provides a comprehensive theoretical framework uniting the study of trainability and generalization in deep neural networks (DNNs). The major challenge addressed is the lack of a theory that explains why overparameterized, highly non-convex DNNs are efficiently trainable with simple SGD and simultaneously demonstrate superior generalization even in regimes where classical theory would predict overfitting. The proposed framework, rooted in convex conjugate duality, not only subsumes much of existing loss function theory, but also communicates how model architecture—depth, width, parameter sharing—and data properties conspire to produce observed learning dynamics.

Theoretical Foundations: Practical Learnability and the Conjugate Framework

The paper builds on the statistical insight that conditional distribution estimation from finite data is fundamentally constrained by the Pitman-Darmois-Koopmans theorem, restricting "practically learnable" conditional distributions to the exponential family unless extra structure is imposed. Consequently, virtually all machine learning problems become conditional exponential family estimation, typically solved by minimizing the negative log-likelihood.

This insight yields a formalization: all practical learning objectives can be cast as the minimization of a Fenchel-Young loss—induces by a strictly convex generating function—between the target and dual-transformed model output, subject to convex constraints encoding domain knowledge. This formal construction is general, directly encompassing softmax cross entropy, MSE, and many other standard objectives.

Figure 1: Schematic illustration of the conjugate learning framework: input transformation, model output, conjugate mapping, and target approximation are tightly integrated via duality.

Key Advancements:

Rigorously proves that Fenchel-Young losses are the unique class of well-behaved (differentiable, strictly convex, proper) loss functions for finite data estimation.
Explicitly quantifies and leverages convex constraints (e.g., probability simplex for classification), showing architectural mechanisms such as softmax emerge naturally in the conjugate duality view.
Offers a universal protocol for integrating prior knowledge and ensuring target space consistency.

Mechanisms of Trainability: Gradient Energy and Structure Matrix

A central theoretical contribution is the identification of the "structure matrix" $A_x = \nabla_\theta f_\theta(x)\nabla_\theta f_\theta(x)^\top$ , which characterizes the sensitivity of model output to parameters and encodes architectural properties such as depth, width, skip connections, and parameter sharing.

The gradient energy (expected squared gradient norm of loss w.r.t. parameters) is shown to tightly bound the empirical (training) risk, up to scalar factors determined by the extremal eigenvalues of the structure matrix:

Figure 2: Mapping of gradient energy minima to empirical risk minima, with critical dependence on the conditioning of the structure matrix.

Critical Theoretical Results:

Empirical risk is sandwiched between multiples of gradient energy determined by the structure matrix's minimal and maximal eigenvalues. If $A_x$ is well-conditioned (positive definite and has low condition number), minimizing the gradient energy suffices to guarantee low empirical risk, regardless of non-convexity.
For softmax cross entropy and MSE, explicit, tight bounds between empirical risk and gradient energy are derived.
Architectural modifications (e.g., skip connections) can mitigate the exponential eigenvalue decay induced by increased depth, explaining the practical trainability improvements observed for ResNets and similar structures.
Figure 3: Architectural schematic: skip connections (Model B) mitigate depth-induced eigenvalue decay of the structure matrix as compared to plain networks (Model A).
Overparameterization (increased width or independent parameters) provably reduces the spectral gap (condition number) of $A_x$ , placing the network near the "lazy regime" of NTK theory where training is near-linear and the risk bounds become tight.

Optimization Dynamics: SGD, Batch Size, and Gradient Correlation

The analysis of SGD is generalized: convergence of gradient energy can be tightly bounded in terms of a "gradient correlation factor" $M$ , which quantifies how mini-batch updates transfer to out-of-batch samples. Crucially, $M$ is reduced by (a) smaller batch sizes, (b) more parameter independence, (c) sparser connectivity—all properties that both empirical practice and architecture search favor for robust optimization.

Hence, controlling the structure matrix (by architecture), gradient energy (by SGD and batch/batch-size tuning), and $M$ yields a unified recipe for efficient trainability in deep models.

Sharp Generalization Bounds: Deterministic and Probabilistic

Generalization analysis leverages generalized (and classical) conditional entropy, deriving:

Deterministic generalization bounds: For any sampling method, the generalization gap is upper bounded by model-specific factors (maximum loss $\gamma_\Phi$ , relative/absolute information loss via the model, and generalized conditional entropy of the data). This bound is architecture-agnostic and does not require i.i.d. assumptions.
Probabilistic bounds: For i.i.d. samples, the generalization error's distribution is controlled by the effective support size of the feature space (collapsed by absolute information loss), sample size, model output stability ( $\gamma_\Phi$ ), and data distribution smoothness.
The theory unifies regularization (e.g., via $L_2$ or $L_p$ norm) as explicitly controlling $\gamma_\Phi$ , thus linking norm-based parameter control directly to tight generalization bounds rather than solely as implicit bias.

Empirical Validation

Comprehensive experiments are performed with classic architectures (LeNet, ResNet18, ViT) and custom ablation networks across multiple datasets to empirically substantiate:

Throughout training, the upper and lower theoretical bounds (function of structure matrix eigenvalues and gradient energy) match observed empirical risk closely.
Skip connections, depth scaling, and width scaling impact the structure matrix exactly as predicted by the theory. Overparameterization reliably reduces the spectral gap, and deeper residual architectures maintain manageable eigenvalues.
Dynamic correlation measurements (via sliding window Pearson coefficients) confirm that empirical risk is controlled by gradient energy only when structure matrix eigenvalues have stabilized, as the theory predicts.

Figure 4: Training dynamics for LeNet under softmax cross entropy: empirical risk and theoretical bounds evolve in lockstep throughout training.

Figure 5: Training dynamics for ResNet18 with softmax cross entropy, showing similar empirical-theoretical matching.

Figure 6: Training dynamics of ViT on softmax cross entropy. Dynamics are consistent with convolutional and residual architectures, validating the theory’s generality.

Implications and Theoretical Synthesis

The conjugate learning theory offers a concrete answer to several open problems in deep learning theory:

Optimization Hardness: It clarifies why SGD can globally minimize highly non-convex, overparameterized DNNs, provided structure matrix conditioning is preserved through architectural design.
Generalization Paradox: The observed strong generalization in deep, overparameterized DNNs becomes tractable; the crucial factors are architectural control of output space, information-theoretic regularization, and batch-size/structure-induced gradient decorrelation—rather than traditional notions of VC-dimension or hypothesis class cardinality.
Loss Function Selection: The Fenchel-Young framework is both necessary and sufficient for loss function design; ad hoc choices outside this class are neither theoretically justified nor optimal for trainability or generalization.
Interpretability of Regularization: Classical explicit ( $L_2$ , $L_p$ ) and implicit (architecture, batch size) regularization gain a precise role, quantifiably connected to maximum loss control and, thus, generalization.
Practical Evaluation Beyond Test Sets: Information-theoretic metrics intrinsic to the model and data can, in principle, supplement or supersede conventional test-set-based comparison for robust evaluation in open-world settings.

Conclusion

Conjugate learning theory provides a rigorous, universal framework encapsulating DNN trainability and generalization, grounded in convex conjugate duality and conditional exponential family estimation. The synergy between theory and experiment across architectures and loss functions substantiates its claims. The framework not only demystifies why modern architectures and optimization strategies work but furnishes principled guidelines for the design of future models. Extensions to continuous feature discretization, further tightening of generalization bounds, and adaptive algorithmic strategies informed by the gradient correlation factor are promising directions for advancing both the mathematical and practical frontiers of deep learning.

Markdown Report Issue