Papers
Topics
Authors
Recent
Search
2000 character limit reached

Bayesian Phase Transitions in RL

Updated 19 January 2026
  • Bayesian phase transitions are abrupt shifts in the posterior concentration over policy space, driven by training sample size and model complexity.
  • The methodology leverages singular learning theory and the Local Learning Coefficient to dissect discrete changes in the regret landscape of deep RL agents.
  • Empirical studies in stagewise RL validate the theory by showing distinct regret plateaus and sharp LLC jumps that indicate transitions between policy phases.

Bayesian phase transitions in reinforcement learning (RL) describe abrupt, qualitative changes in the concentration of the Bayesian posterior over policy space as the sample size or training time increases. This phenomenon, rooted in singular learning theory (SLT), hinges on the trade-off between policy accuracy (low regret) and complexity (as measured by the Local Learning Coefficient, LLC), and becomes evident when deep RL agents exhibit stagewise learning dynamics. A geometric lens—where the LLC is an invariant of the regret landscape—reveals that transitions in policy class are governed by the comparative interplay of regret and LLC, rather than regret alone.

1. Singular Learning Theory and the Bayesian Posterior

SLT generalizes classic statistical asymptotics for singular models, such as deep neural networks, where Fisher information is degenerate and the posterior need not be Gaussian. The SLT framework establishes that the free energy (log-evidence) governing the Bayesian posterior mass near a policy parameter w0w_0 decomposes into two primary terms: Free EnergynL^n(w0)+λlogn+O(1)\text{Free Energy} \simeq n\, \widehat{L}_n(w_0) + \lambda\, \log n + O(1) where L^n(w0)\widehat{L}_n(w_0) is the empirical regret (loss) at the minimum, and λ\lambda is the LLC, also known as the real log canonical threshold. In regular models, by the Bernstein–von Mises theorem, the posterior distribution converges to a Gaussian with variance scaling as n1n^{-1}. In singular RL models, competing minima can exist due to complex geometries of the regret landscape, and the posterior shifts mass between them based on the free energy trade-off (Elliott et al., 12 Jan 2026).

2. Generalized Posterior in Deep Reinforcement Learning

Bayesian inference in RL is constructed via a generalized Gibbs posterior over policy parameters wWw \in W, leveraging observed trajectory data T1,...,TnT_1, ..., T_n from on- or off-policy rollouts. The population regret is defined as

G(w)=RmaxETqw[r(T)]G(w) = R^*_{\max} - \mathbb{E}_{T \sim q_w}[r(T)]

where r(T)r(T) is the cumulative discounted return and qwq_w denotes the trajectory distribution under policy ww. The empirical regret, estimated via importance sampling, forms the basis for the posterior: pβ(wDn)exp[nβGn(w)]π(w)p_\beta(w | D_n) \propto \exp[-n\beta G_n(w)] \pi(w) Here, β>0\beta > 0 is an inverse temperature parameter and π(w)\pi(w) is a prior over ww. SLT then predicts the shape and concentration of pβp_\beta are governed locally by the LLC near each minimum of G(w)G(w).

3. Local Learning Coefficient (LLC) and Free Energy Asymptotics

The LLC λ\lambda encodes the geometric complexity of the regret basin around a local minimum w0w_0: λ(w0)=limr0+λ(Br(w0))\lambda(w_0) = \lim_{r \rightarrow 0^+} \lambda(B_r(w_0)) where Br(w0)B_r(w_0) is an rr-neighborhood of w0w_0. The LLC quantifies the “order” of vanishing of the regret at the minimum, serving as a complexity penalty in the free energy equation. The main result (Theorem 2.3) for the RL Gibbs posterior states:

  • Evidence asymptotics: Zn,β(U)C(U,β)nλ(U)(logn)m(U)1Z_{n, \beta}(U) \sim C(U, \beta) n^{-\lambda(U)} (\log n)^{m(U)-1}
  • Posterior expected loss: Epβ[nGn(w)wU]=nGn(w0)+λ(U)logn+Op(1)E_{p_\beta}[n G_n(w) | w \in U] = n G_n(w_0) + \lambda(U) \log n + O_p(1)

The posterior will shift concentration from a simple (low-regret, low-LLC) policy w1w_1 to a more complex (lower-regret, higher-LLC) policy w2w_2 as nn increases, at a critical sample size nn^* given by

n(G1G2)(λ2λ1)lognn^* (G_1 - G_2) \simeq (\lambda_2 - \lambda_1) \log n^*

marking the Bayesian phase transition.

4. Empirical Demonstration in Stagewise RL

These theoretical phase transitions were empirically studied in the "cheese-in-the-corner" (CITC) gridworld environment. The RL agent, implemented as a deep policy network (15 convolutional layers plus MLP, trained by vanilla REINFORCE), exhibits the following phasewise policy evolution:

  • Phase 1: Random up/left actions (high regret, simple)
  • Phase 2a: Deterministic up/left moves (still goal-location agnostic)
  • Phase 2b: Deterministically toward top-left, transitioning if cheese encountered
  • Phase 3: Direct, optimal path to cheese (lowest regret, highest complexity)

Regret and LLC were estimated via preconditioned SGLD (pSGLD), employing a strong Gaussian prior at selected training checkpoints. Observed dynamics show "opposing staircases": regret plateaus followed by sharp drops, while LLC jumps upward at each transition. For runs with α=0.68,γ=0.975\alpha = 0.68, \gamma = 0.975: | Phase | Mean LLC | Mean Regret | |---------|-------------------|---------------| | Phase 1 | 30.9±3.430.9 \pm 3.4 | 0.7\sim0.7 | | Phase 2b| 106.8±16.2106.8 \pm 16.2 | 0.4\sim0.4 | | Phase 3 | 561.5±40.5561.5 \pm 40.5 | 0\sim0 |

This discrete staircasing empirically verifies the predicted Bayesian phase transitions (Elliott et al., 12 Jan 2026).

5. LLC as a Geometric Indicator and OOD Transition Detector

LLC estimation detects latent transitions between policy classes, even when regret-based metrics are blind. For instance, when regret is computed only over a subset of states (e.g., cheese always in the corner), Phase 2b and Phase 3 policies both achieve zero regret, but the LLC estimator sharply distinguishes the true transition. This demonstrates the LLC’s sensitivity to the underlying basin geometry in parameter space, not just behavioral performance.

6. Broader Implications and Research Extensions

The geometric regret–complexity view yields several implications:

  • Bayesian RL posterior mass may prefer simpler, suboptimal policies when sample sizes are small, resulting in goal misgeneralization or reward hacking behaviors for finite datasets.
  • Lower-regret policies can paradoxically be less Bayesian-optimal if they entail higher LLC (complexity penalty).
  • LLC is implicated in safe/robust RL algorithm design, out-of-distribution failure prediction via state-marginalized LLCs, and understanding instrumental convergence in policy space.
  • Extensions include layerwise LLC measurement (as in Wang et al., 2025), sensitizing LLC and posterior metrics to hyperparameter changes, relaxing deterministic MDP assumptions per Watanabe’s relative variance conditions, and connecting theoretical posterior concentration to empirically observed SGD stagewise dynamics.

This suggests a geometric theory of RL learning, where policy phase transitions are predictable via singular learning theory, supporting ambitious development of theory-driven diagnostics and inductive bias control in reinforcement learning systems (Elliott et al., 12 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Bayesian Phase Transitions in Reinforcement Learning.