Neural Thermodynamic Laws for Large Language Model Training

Published 15 May 2025 in cs.LG, cs.AI, physics.data-an, and stat.ML | (2505.10559v1)

Abstract: Beyond neural scaling laws, little is known about the laws underlying LLMs. We introduce Neural Thermodynamic Laws (NTL) -- a new framework that offers fresh insights into LLM training dynamics. On the theoretical side, we demonstrate that key thermodynamic quantities (e.g., temperature, entropy, heat capacity, thermal conduction) and classical thermodynamic principles (e.g., the three laws of thermodynamics and the equipartition theorem) naturally emerge under river-valley loss landscape assumptions. On the practical side, this scientific perspective yields intuitive guidelines for designing learning rate schedules.

Abstract PDF Chat (Pro)

Summary

The paper introduces a framework that maps LLM loss landscapes to thermodynamic principles, decomposing losses into fast (thermal) and slow (river) dynamics.
The paper derives optimal learning rate decay schedules, showing that a 1/t decay effectively minimizes thermal loss and aligns with empirical trends in LLM training.
The paper demonstrates that entropic forces subtly influence optimizer trajectories, supporting the use of higher stable-phase learning rates to reduce final validation loss.

This paper, "Neural Thermodynamic Laws for LLM Training" (2505.10559), introduces a novel framework, Neural Thermodynamic Laws (NTL), that draws analogies between the training dynamics of LLMs and principles from classical thermodynamics. The core idea is to leverage this analogy to gain theoretical insights into LLM training, particularly concerning the complex learning rate schedules used in practice.

The framework is built upon the observation that the loss landscape of LLMs exhibits a "river-valley" structure. This structure is characterized by directions of high sharpness (valleys) and directions of flatness (rivers). Training dynamics can be decomposed into fast dynamics (rapid movement and fluctuations within the sharp valleys) and slow dynamics (gradual movement along the flat riverbed). This decomposition is analogous to the quasi-static equilibrium found in thermodynamics, where fast microscopic degrees of freedom (like gas molecules) equilibrate rapidly while macroscopic variables (like piston volume) change slowly. The paper maps the total loss $\ell$ to internal energy $U$ , decomposing it into a fast component $\ell_f$ (thermal loss, analogous to heat $Q$ ) and a slow component $\ell_s$ (river loss, analogous to work $W$ ).

To formalize this, the authors propose a 2D toy model of the river-valley landscape: $\ell(x,y) = \frac{1}{2}a(y)x^2 + c(y)$ . Here, $x$ is the fast variable and $y$ is the slow variable. $a(y)$ represents the sharpness of the valley at position $y$ , and $c(y)$ is the loss at the bottom of the valley (the riverbed). The fast loss is $\ell_f(x,y) = \frac{1}{2}a(y)x^2$ and the slow loss is $\ell_s(y) = c(y)$ .

Fast Dynamics in Equilibrium (Stable Phase)

Focusing on the fast variable $x$ while holding $y$ fixed, the toy model reduces to a 1D quadratic loss $\ell_f(x) = \frac{1}{2}ax^2$ . The paper analyzes the dynamics of Stochastic Gradient Descent (SGD) and Signed Gradient Descent (SignGD) on this loss with a fixed learning rate $\eta$ and gradient noise $\sigma_g$ . Under certain conditions (specifically, $0 < a < 2/\eta$ for SGD), the dynamics converge to a Gaussian steady-state distribution $p(x)$ with a standard deviation $\sigma$ .

For SGD, in the flat limit ( $a\eta \ll 1$ ), the standard deviation is approximately $\sigma^{\rm SGD} \approx \sqrt{\frac{\eta}{2a}}\sigma_g$ . For SignGD, in the flat limit ( $a\eta \ll \sigma_g/\eta$ ), it's approximately $\sigma^{\rm SignGD} \approx (\frac{\pi}{8})^{1/4}\sqrt{\frac{\sigma_g\eta}{a}}$ .

In both cases, $\sigma \propto \eta^{1/2} a^{-1/2} \sigma_g^n$ (where $n=1$ for SGD, $n=1/2$ for SignGD). This shows that $\sigma$ increases with $\eta$ and $\sigma_g$ and decreases with sharpness $a$ . The average thermal loss is $\overline{\ell_f} = \mathbb{E}[\frac{1}{2}ax^2] = \frac{1}{2}a\sigma^2$ . Substituting the flat-limit approximations for $\sigma$ , we find that $\overline{\ell_f}$ becomes independent of $a$ : For SGD: $\overline{\ell_f}^{\rm SGD} \approx \frac{1}{2}a (\frac{\eta}{2a}\sigma_g^2) = \frac{1}{4}\eta\sigma_g^2$ . For SignGD: $\overline{\ell_f}^{\rm SignGD} \approx \frac{1}{2}a ((\frac{\pi}{8})^{1/2}\frac{\sigma_g\eta}{a}) = \sqrt{\frac{\pi}{32}}\sigma_g\eta$ .

This independence of $\overline{\ell_f}$ from sharpness $a$ is analogous to the equipartition theorem in thermodynamics, where energy is distributed equally among quadratic degrees of freedom regardless of their 'stiffness'. The paper interprets the learning rate $\eta$ as an effective temperature ( $T \sim \eta$ ) because the thermal loss $\overline{\ell_f}$ scales linearly with $\eta$ . The rate of change of thermal loss with respect to $\eta$ , $C \equiv \partial \overline{\ell_f} / \partial \eta$ , is interpreted as heat capacity.

Practical Implication: The average loss contributed by fast dynamics in any given direction in the valley is roughly proportional to the learning rate and independent of how sharp that direction is. This suggests that reducing the learning rate is a direct way to reduce the fluctuations and corresponding loss contribution in the valley directions. This also explains why the validation loss at the end of the decay phase in LLM training often scales linearly with the final learning rate $\eta_{\rm min}$ , as shown in empirical experiments on GPT-2 [(2505.10559), Figure 2b].

Fast Dynamics in Annealing (Decay Phase)

During the decay phase of training, the learning rate $\eta$ is gradually reduced. As $\eta$ decreases, the standard deviation $\sigma$ of the fast variable distribution is expected to decrease, reducing the thermal loss $\overline{\ell_f}$ . However, $\eta$ also acts as a step size, influencing the speed at which the system can adapt to the changing 'temperature'. If $\eta$ decays too quickly, the system might not maintain equilibrium, leading to sub-optimal reduction in $\overline{\ell_f}$ .

The paper derives an optimal learning rate decay schedule for the 1D quadratic toy model that minimizes $\sigma_t$ at each step $t$ . Starting from an initial learning rate $\eta$ , the optimal schedule is found to be approximately $\eta_t \propto 1/t$ , specifically $\eta_t = \frac{\eta/2}{1 + t/t_h}$ where $t_h$ is a characteristic time scale depending on $a$ , $\eta$ , and $\sigma_g$ . For SGD, $t_h^{\rm SGD} = 2/(a\eta)$ , and for SignGD, $t_h^{\rm SignGD} = \sqrt{2\pi}\sigma_g/(a\eta)$ . This schedule predicts that decaying to $\eta=0$ would take infinite time, implying that decaying to zero in finite time might be suboptimal.

The evolution of thermal loss under a sudden drop in learning rate (from $\eta_A$ to $\eta_B$ ) is shown to follow an exponential decay towards the new equilibrium loss $\overline{\ell_{eq}(\eta_B)}$ , similar to Fourier's law of thermal conduction ( $Q \propto T_A - T_B$ ). This process is also related to the second law of thermodynamics, as the system cannot reach a thermal loss lower than the equilibrium loss at the final temperature $\eta_B$ .

Practical Implication: The optimal $1/t$ decay schedule provides a theoretical basis for non-linear decay schedules observed to perform well in practice (e.g., 1-sqrt decay). It suggests that the rate of decay should depend on the sharpness $a$ and noise level $\sigma_g$ . Furthermore, decaying the learning rate too quickly can prevent the system from reaching the desired low thermal loss, leading to higher final validation loss. Empirical results in the paper's appendix support that starting the decay too late (shortening the decay phase) can be detrimental.

River Dynamics and Entropic Forces

The fast dynamics in the valley can influence the slow dynamics along the river. In the 2D toy model $\ell(x,y) = \frac{1}{2}a(y)x^2 + c(y)$ , the equilibrium distribution over $x$ at a fixed $y$ depends on $a(y)$ . The slow variable $y$ experiences a force composed of the gradient of the slow loss $c(y)$ ( $F_{\rm btm} = -c'(y)$ ) and an effective entropic force arising from the average of the fast gradient with respect to $y$ : $F_{\rm ent} = -\overline{\partial \ell_f / \partial y} = -\frac{1}{2}a'(y)\overline{x^2} = -\frac{1}{2}a'(y)\sigma(y)^2$ . Substituting $\sigma(y)^2 \propto \eta \sigma_g^n / a(y)$ , we get $F_{\rm ent} \propto -\frac{\sigma_g^n \eta}{a(y)} a'(y) \propto -\eta \sigma_g^n \frac{a'(y)}{a(y)}$ . The term $\frac{a'(y)}{a(y)}$ can be written as $\frac{d}{dy}(\log a(y))$ , suggesting that $-\frac{1}{2}\log a(y)$ can be related to entropy. Specifically, the entropy of the Gaussian distribution $p(x)$ scales as $S \propto \log \sigma \propto -\frac{1}{2}\log a(y)$ . The entropic force is proportional to the gradient of $-\log a(y)$ , pushing the slow variable $y$ towards regions of lower sharpness (larger $a(y)$ is sharper, so smaller $1/a(y)$ is flatter).

Practical Implication: Entropic forces can influence the path taken along the riverbed. If the river runs through a region where the valley widens rapidly ( $a'(y) < 0$ ), the entropic force pushes the optimizer forward along the river. If the valley narrows rapidly ( $a'(y) > 0$ ), the entropic force pushes backward. This can lead to "entropic trapping" where the optimizer gets stuck if the entropic force opposes the gradient of the slow loss. The magnitude of this force depends on $\eta$ and $\sigma_g$ .

The paper tests for the existence of entropic forces in LLM training by comparing loss curves for schedules with different stable learning rates but the same total $\eta$ sum. In the gradient flow limit ( $\eta \to 0$ ), only the $\eta$ sum matters for the final state of the slow variables. Any misalignment observed with finite $\eta$ indicates entropic forces. The LLM experiments show relatively good alignment, suggesting that entropic forces are small but slightly negative (pushing towards flatter regions), corresponding to a slightly narrowing valley structure along the river on average in the early training phase.

Summary of Learning Rate Roles and Final Loss

The learning rate $\eta$ plays multiple roles:

Temperature/Gaussian Width: Controls the magnitude of fluctuations in valley directions ( $\sigma \propto \sqrt{\eta}$ ).
Entropic Force Magnitude: Scales the entropic force ( $\propto \eta \sigma_g^n$ ).
Time Scale/Step Size: Controls the speed of movement along both valley and river directions.

The final training loss is influenced by the total learning rate sum ( $D$ , which primarily affects the slow loss $\ell_s$ ) and the final learning rate $\eta_{\rm min}$ (which controls the final thermal loss $\ell_f$ ), plus potential corrections from entropic forces ( $\Delta_{\rm entropic}$ ) and insufficient annealing ( $\Delta_{\rm anneal}$ ). $\ell_{\rm final} \approx \ell(D, \eta_{\rm min}) + \Delta_{\rm entropic} + \Delta_{\rm anneal}$ .

To reduce the final loss, one can reduce $\eta_{\rm min}$ or increase the total $\eta$ sum $D$ . Reducing $\eta_{\rm min}$ requires a longer decay phase to avoid insufficient annealing ( $\Delta_{\rm anneal}$ ). Increasing $D$ can be achieved by increasing the stable phase learning rate $\eta$ . The paper's experiments suggest that larger stable phase $\eta$ can lead to lower loss (due to increased $D$ ) without significantly increasing the required decay time (as $T_d \sim 1/\eta_{\rm min}$ is independent of stable $\eta$ ) or introducing large entropic forces (which were found to be small). This supports the use of potentially higher stable phase learning rates than currently common, provided numerical stability is maintained.

Implementation Considerations and Limitations

The analysis provides practical intuition for designing learning rate schedules.

The stable phase helps accumulate $\eta$ sum ( $D$ ) for progress along the river.
The decay phase reduces $\eta$ to minimize final thermal loss $\ell_f$ .
The optimal decay is roughly $1/t$.
Decaying to $\eta_{\rm min} > 0$ might be better than $\eta_{\rm min}=0$ .
Increasing the stable phase $\eta$ (within limits of numerical stability and edge of stability) might be an effective way to reduce final loss by increasing the $\eta$ sum.

The paper highlights several limitations of the current analysis, acknowledging that it uses a physicist's approach emphasizing intuition and simplification over mathematical rigor:

The derivations assume isotropic or 1D quadratic landscapes, whereas LLM loss landscapes are highly anisotropic and complex.
The river is treated as straight, but it is likely curved in practice.
Momentum and weight decay are ignored, which are crucial components of modern optimizers like Adam and AdamW.
The Gaussian approximation for steady states is not always strictly accurate.

Despite these simplifications, the framework provides valuable qualitative and semi-quantitative insights that align with empirical observations in LLM training and offer guiding principles for hyperparameter tuning and optimizer design. Future work could extend this framework to incorporate more realistic landscape features and optimizer complexities.