Thermodynamics of Learning

Updated 28 February 2026

Thermodynamics of Learning is a framework that models learning as a physical process where energy dissipation and entropy reduction drive information acquisition.
It employs stochastic and geometric techniques to quantify trade-offs, establishing universal bounds on efficiency and the minimal cost of model adaptation.
The approach informs algorithm design and continual learning strategies by linking irreversible phenomena with practical constraints in both biological and artificial systems.

Thermodynamics of Learning encompasses a rigorous program that recasts learning phenomena—across biological, neural, and artificial systems—as fundamentally thermodynamic processes subject to precise physical constraints. This perspective leverages the machinery of stochastic and nonequilibrium thermodynamics, mapping notions such as entropy, free energy, irreversibility, and efficiency directly onto learning algorithms, statistical inference, and network adaptation. Recent advances have characterized model training as a dissipative process governed by thermodynamic trade-offs, clarified universal efficiency bounds for information acquisition, and revealed deep geometrical constraints on learning capacity, adaptivity, and continual updating.

1. Thermodynamic Frameworks for Learning: State, Dynamics, and Cost

Learning systems are modeled as open, driven dissipative systems exchanging energy, entropy, and information with their environment. Typical models distinguish “microstate” variables representing model parameters (weights, θ) and sample/representation variables (x), with corresponding energy-like functionals.

A key construction is the ensemble description over model configuration space Θ: at each training step, the distribution q(θ) evolves under stochastic gradient-based updates, reflecting both initial randomness and data-induced variability. The driving functional is often an epistemic or Helmholtz free energy,

$F[q] = \mathbb{E}_q[\Phi(\theta)] - T H[q]$

where Φ is the loss (e.g., negative log-likelihood), T is an effective temperature (often set by stochasticity or initialization), and H[q] is (Shannon) entropy (Okanohara, 24 Jan 2026).

Learning is then a transport process in parameter space: as q(θ) evolves from its initial configuration to one concentrated near learned minima, uncertainty and entropy are irreversibly lost, and the reduction in free energy is recorded as dissipative cost. The irreversible entropy production (EP) along the learning trajectory quantifies the minimal dissipative work required to realize the observed information gain. In continuous-time regimes, this is determined by the Benamou–Brenier action in Wasserstein space, showing that every nontrivial learning process incurs a path-dependent thermodynamic cost (Okanohara, 24 Jan 2026, Parsi, 2023).

In energy-based or probabilistic models, the observable “states” (data samples or representations) are associated with Gibbs/Boltzmann distributions under the learned energy landscape, giving a direct physical correspondence between loss minimization and thermodynamic equilibration (Alemi et al., 2018, Parsi, 2023, Salazar, 2021).

2. Information-Theoretic Metrics: Learned vs. Memorized Information

Thermodynamics of learning distinguishes between two principal channels of information flow, which trace the model’s acquisition and internalization of data structure:

Memorized Information (M-info): Quantifies information captured and retained in the model parameters (Θ), formally $I(\Theta;B)$ where B represents the data source (Parsi, 2023).
Learned Information (L-info): Measures the information extractable about the data by sampling from the current model, $I(X;\Theta)$ , with X a sample from $p_\Theta(x)$ .

The parameter trajectory (Θ) serves as a heat reservoir, whose entropy increase precisely matches the quantity of memorized information. The L-info, in contrast, reflects the model’s effective knowledge, as measured by the reduction in conditional entropy $I(X;\Theta) = S_X - S_{X|\Theta}$ . The stochastic thermodynamic formulation yields a fluctuation theorem linking the total entropy production of the observable subsystem (X) across a training run to the net accumulation of L-info:

$\Delta I_{X;\Theta} = \Sigma_X - \Sigma_{X|\Theta}$

with ΣX the EP combining both model and sample updates, and Σ{X|\Theta} the (possibly vanishing) conditional EP (Parsi, 2023). Every bit of L-info acquired is underpinned by an equivalent measure of irreversibility (dissipation) in the learning trajectory.

3. Irreversibility, Entropy Production, and the Epistemic Speed Limit

Learning is generically an irreversible, non-equilibrium process. The reduction of epistemic free energy along any finite-time trajectory necessarily incurs entropy production. The irreversibility is rooted in the finite rate of ensemble transport in parameter space, quantified by the path-integral

$\Sigma_{0:1} = \int_0^1 \sigma_s ds,\qquad \sigma_s = \int_\Theta q_s(\theta) \|v_s(\theta)\|^2 d\theta$

where $v_s$ is the velocity field generating ensemble evolution.

A fundamental bound, the Epistemic Speed Limit (ESL), constrains how efficiently learning can occur: to move from initial $q_0$ to final $q_1$ requires at a minimum

$\Sigma_{0:1} \geq W_2(q_0, q_1)^2$

with $W_2$ the 2-Wasserstein distance between ensembles (Okanohara, 24 Jan 2026). Faster learning (shorter allowed time) increases the minimal required dissipation. Structured curriculum procedures or distillation algorithms that guide learning along near-geodesic (constant-speed, low-dissipation) paths in distribution space are thus naturally more efficient.

These geometric constraints have concrete consequences: after a concentrated learning phase (low-entropy ensemble), adaptation to a substantially new task necessarily incurs large EP, constraining realistic rates of continual or transfer learning.

4. Universal Thermodynamic Bounds on Learning Efficiency

A sequence of stochastic thermodynamic results establishes that the information acquired by a learning process is universally bounded by the thermodynamic cost incurred. For generic neural or probabilistic networks, the following bounds hold:

The information–cost bound: $I(\text{labels; outputs}) \leq \Delta S(\text{weights}) + Q$ (heat dissipation) (Goldt et al., 2016, Goldt et al., 2017).
Learning efficiency: $\eta = \frac{\text{information acquired}}{\text{thermodynamic cost}} \leq 1$ .
The efficiency is only saturated in the quasi-static, reversible limit (infinitely slow learning, no heat loss) and for statistically optimal learning protocols.

Recent work has derived even stronger bounds for multi-component, dissipative learning systems, using refined inequalities (Cauchy–Schwarz, dynamical activity) that show $\eta$ is strictly less than unity, with the deficit determined by the structure and rate of transitions in the learning dynamics (Su et al., 2022, Li et al., 2023).

In quantum and classical implementations, the minimal achievable work per bit of learning is determined by energy–entropy relations; for error probability $\epsilon$ , the work required for accurate decision/learning scales as $k_B T \ln(1/\epsilon)$ , achieving the Landauer limit only in idealized physical systems (Milburn et al., 2022, Zhao et al., 9 Apr 2025).

5. Thermodynamic Geometry and Capacity Limits in Continual Learning

Finite-time, compositional learning dynamics shrink the geometric “support volume” of dynamically accessible configurations. Each sequential learning phase acts as a volume-contracting map; the semigroup property ensures the effective rank, or “reconfiguration dimension,” can only decrease. This sets a fundamental, trajectory-level capacity limit for continual learning:

The compatible effective rank $\mathcal{R}_A(t)$ captures the volume of task-preserving directions remaining after learning task A.
The capacity-threshold criterion states: to learn task B after A without forgetting A, the stable rank $m_B$ of the new task’s Hessian must satisfy $m_B \leq \mathcal{R}_A(t)$ . If $m_B > \mathcal{R}_A(t)$ , any sufficient adaptation will necessarily incur catastrophic forgetting—loss of the previously acquired task (Okanohara, 8 Feb 2026).

These results show that the trajectory-level, thermodynamically enforced loss of adaptability underlies the phenomenon of continual learning failure and “critical period closure,” not merely the lack of multi-task optimizers.

6. Applications and Consequences: Enhanced Models, Algorithm Design, and Physical Realization

The thermodynamic formalism enables new algorithmic recipes and regularization schemes:

Regularizers based on controlling entropy production ( $Q_X$ or $\Sigma_{X|\Theta}$ penalization) can directly manage the efficiency and robustness of learning (Parsi, 2023).
Over-parameterization is interpreted as natural from a thermodynamic perspective: vast parameter spaces act as ideal heat reservoirs, enabling near-reversible, low-loss adaptation (Parsi, 2023).
Thermodynamic bounds clarify the optimal scheduling of learning rates and data presentation (e.g., in curriculum or distillation).
In the context of physical implementations, energy efficiency constraints bound the possible accuracy and learning rates of hardware learners, whether CMOS, photonic, or quantum (Milburn et al., 2022, Zhao et al., 9 Apr 2025).

Physically inspired models, such as the thermodynamic analysis of Restricted Boltzmann Machines or energy-based and self-supervised networks, explicitly leverage these insights to identify critical points, drive feature extraction, and inform the design of training protocols and network architectures (Funai et al., 2018, Decelle et al., 2018, Salazar, 2021).

7. Outlook: Fundamental Insights and Open Questions

The thermodynamics of learning unifies statistical inference, information theory, and non-equilibrium physics. This framework establishes that:

Every bit of useful information learned comes at an irreducible thermodynamic cost, determined by the geometry and speed of training.
There exist universal trade-offs between speed, accuracy, adaptability, and efficiency, governed by entropy production, dissipation, and the structure of parameter evolution.
Future directions include the design of learning algorithms with built-in thermodynamic optimality, exploration of hardware implementations achieving physical bounds, and addressing open theoretical challenges such as the scaling of accessible volume, multi-scale coarse-graining, and the role of quantum thermodynamic effects in learning (Okanohara, 24 Jan 2026, Parsi, 2023, Milburn et al., 2022, Okanohara, 8 Feb 2026).

The thermodynamic perspective continues to reveal foundational restrictions and new principles underlying all systems that learn from and adapt to their environment.