- The paper presents a unified framework that integrates information theory with statistical learning by optimizing divergence measures like KL divergence and f-divergences across different model classes.
- The paper rigorously analyzes supervised learning techniques—including regression, classification, and deep neural networks—emphasizing optimization properties and generalization trade-offs.
- The paper systematically examines diverse generative models, including VAEs, GANs, diffusion, and score-based models, highlighting practical implications for modern AI and statistical inference.
This chapter, slated for the third edition of Elements of Information Theory by Cover and Thomas, provides a rigorous unification of information theory and statistical learning, highlighting both foundational principles and contemporary model classes that underpin modern machine learning practice. The exposition organizes the landscape of statistical learning through the lens of divergence minimization, information-theoretic bounds, and algorithmic realizations, connecting classic approaches (such as regression and PCA) with advanced generative modeling techniques (VAEs, GANs, diffusion, and score-based models).
Statistical learning is framed as the optimization of the closeness (often in relative entropy, or KL divergence) between the empirical data distribution Pdata​ and a chosen model family P, parameterized by θ. Importantly, the true data-generating distribution is not assumed to reside in P, introducing inherent approximation and generalization trade-offs that are not present in classical parameter estimation. Model selection is thus governed by both prior assumptions and the expressive power of P, be it a family of conditional Gaussians, mixtures, or deep neural networks.
The chapter instantiates this generality by systematically describing the divergence measures (cross-entropy, f-divergences, Fisher divergence) that provide the most natural training objectives for different model classes, establishing the theoretical connection between maximum likelihood estimation and cross-entropy minimization and motivating alternatives when MLE is computationally intractable or statistically problematic.
Supervised Learning: Regression and Classification
Linear and logistic regression are revisited from an information-theoretic perspective. Minimizing conditional cross entropy directly corresponds to maximizing the likelihood function of the observed data, thus justifying the ubiquity of these criteria in both regression and classification. For linear regression, the model class is all conditional Gaussians with linear mean functions; the MLE yields the familiar least squares estimators. For logistic regression, the model is Bernoulli with a logistic link, maximizing the (convex) average log loss.
The extension to multi-class logistic regression via the softmax function generalizes the model to categorical outputs. Unlike linear regression, closed-form solvers are not generally available, but the optimization landscape remains convex, thus gradient ascent methods suffice for global convergence.
Neural Network Architectures
Neural networks are introduced as universal approximators, bridging the gap between classical statistical models and high-capacity function approximators. The universal approximation theorem is invoked for two-layer networks with nonlinear activation functions, but practical limitations drive the necessity for deeper architectures. The chapter outlines feedforward, convolutional, and attention-based network architectures as specialized cases, with training via stochastic gradient-based optimization and backpropagation. The non-convexity of deep network training is highlighted, emphasizing the divergence from earlier, closed-form statistical estimators.
Generative Models: Autoregressive, Latent Variable, and Divergence-Based Approaches
Autoregressive Models
Described via the chain rule, autoregressive models sequentially parameterize each conditional distribution. Training scales poorly with the number of conditionals if modeled independently, thus modern practice (e.g., Transformers) employs a single parametric model with shared weights for all conditionals, enabling amortized parameterization and computation. Perplexity is stressed as a primary metric for language modeling.
Latent Variable Models and EM
Latent variable models, such as probabilistic PCA and Gaussian Mixture Models, are rigorously formulated, with marginal likelihood maximization approached via the EM algorithm. The EM algorithm is derived as coordinate ascent on a lower bound to the log-likelihood, and its connection to variational approaches is made explicit. The evidence lower bound (ELBO) serves as a general strategy for training intractable generative models, with equality to the true log-likelihood achieved if and only if the variational posterior coincides with the true posterior.
Variational Autoencoders (VAEs)
VAEs generalize to high-capacity encoder and decoder parameterizations using neural networks, with amortized inference facilitating scalable estimation of the variational lower bound. The ELBO encourages both faithful data reconstruction given latent codes and minimization of the divergence between approximate and true posteriors.
Diffusion Models
Diffusion models are described as hierarchical Markov latent-variable models trained by maximizing an ELBO constructed with a forward noising process. Model training, based on reverse-time diffusion, enables high-fidelity data generation. The formulation captures connections to both variational methods and score matching, as detailed below.
GANs: f-Divergence Minimization via Adversarial Training
GANs are described as minimax games between discriminators and generators, with the Jensen-Shannon divergence (a type of f-divergence) emerging as the optimality criterion under classical loss functions. The theoretical analysis transparently reveals the shortcoming that high likelihood or even low divergence does not guarantee high sample fidelity, motivating the continued investigation of alternative divergences and robust training procedures.
Score-Based and Fisher Divergence Models
Score-based models eschew traditional likelihood-based losses in favor of Fisher divergence. The motivation is that likelihood computation can be intractable for energy-based models (due to unknown partition functions), while the score (gradient of the log-density) is often tractable. The denoising autoencoder objective is shown to be equivalent to Fisher divergence minimization, which is tightly connected to Tweedie’s formula from Bayesian estimation theory, directly relating score estimation in noisy domains to Bayes-optimal denoising.
The equivalence between ELBO maximization for diffusion models and denoising score matching is rigorously established, unifying two major generative modeling paradigms.
Comparative Analysis of Divergence Measures
The chapter formalizes a broad variety of divergence measures: f-divergences (including KL, Jensen-Shannon, total variation, Rényi, and χ2), and articulates the theoretical properties (data processing, non-negativity, etc.) and practical consequences of each. The distinctions between likelihood-based, adversarial, and score-based training are made structurally clear in these terms.
Implications and Future Directions
The synthesis of information-theoretic and statistical learning perspectives advanced in this chapter provides a principled taxonomy of machine learning algorithm design, highlights the trade-offs inherent in model selection, and articulates why different divergence measures and training paradigms are preferable in different settings. Practically, this framework clarifies the rationale for the architectural and algorithmic choices underlying models at the forefront of AI (e.g., LLMs, diffusion models, GANs).
On the theoretical front, the chapter's structure presages analytical investigations into generalization, expressivity, approximation error, and convergence rates for modern statistical learning systems, many of which are grounded in the same information-theoretic constructs (e.g., entropy, mutual information, divergence) discussed here.
It is expected that future advances—both in the design of new model classes and in analyzing generalization and robustness—will continue to draw from these foundational concepts, particularly as models reach even higher capacity and are applied to more challenging, multimodal, or adversarially robust domains.
Conclusion
This chapter rigorously integrates statistical learning and information theory, providing a cohesive perspective on both classical and modern approaches to learning from data. With detailed technical treatment of supervised learning, deep architectures, generative models, and divergence minimization strategies, the work yields a foundational reference that is essential for researchers developing, analyzing, or applying statistical learning algorithms in both theoretical and applied settings.