Bayesian Phase Transitions in RL

Updated 19 January 2026

Bayesian phase transitions are abrupt shifts in the posterior concentration over policy space, driven by training sample size and model complexity.
The methodology leverages singular learning theory and the Local Learning Coefficient to dissect discrete changes in the regret landscape of deep RL agents.
Empirical studies in stagewise RL validate the theory by showing distinct regret plateaus and sharp LLC jumps that indicate transitions between policy phases.

Bayesian phase transitions in reinforcement learning (RL) describe abrupt, qualitative changes in the concentration of the Bayesian posterior over policy space as the sample size or training time increases. This phenomenon, rooted in singular learning theory (SLT), hinges on the trade-off between policy accuracy (low regret) and complexity (as measured by the Local Learning Coefficient, LLC), and becomes evident when deep RL agents exhibit stagewise learning dynamics. A geometric lens—where the LLC is an invariant of the regret landscape—reveals that transitions in policy class are governed by the comparative interplay of regret and LLC, rather than regret alone.

1. Singular Learning Theory and the Bayesian Posterior

SLT generalizes classic statistical asymptotics for singular models, such as deep neural networks, where Fisher information is degenerate and the posterior need not be Gaussian. The SLT framework establishes that the free energy (log-evidence) governing the Bayesian posterior mass near a policy parameter $w_0$ decomposes into two primary terms: $\text{Free Energy} \simeq n\, \widehat{L}_n(w_0) + \lambda\, \log n + O(1)$ where $\widehat{L}_n(w_0)$ is the empirical regret (loss) at the minimum, and $\lambda$ is the LLC, also known as the real log canonical threshold. In regular models, by the Bernstein–von Mises theorem, the posterior distribution converges to a Gaussian with variance scaling as $n^{-1}$ . In singular RL models, competing minima can exist due to complex geometries of the regret landscape, and the posterior shifts mass between them based on the free energy trade-off (Elliott et al., 12 Jan 2026).

2. Generalized Posterior in Deep Reinforcement Learning

Bayesian inference in RL is constructed via a generalized Gibbs posterior over policy parameters $w \in W$ , leveraging observed trajectory data $T_1, ..., T_n$ from on- or off-policy rollouts. The population regret is defined as

$G(w) = R^*_{\max} - \mathbb{E}_{T \sim q_w}[r(T)]$

where $r(T)$ is the cumulative discounted return and $q_w$ denotes the trajectory distribution under policy $w$ . The empirical regret, estimated via importance sampling, forms the basis for the posterior: $p_\beta(w | D_n) \propto \exp[-n\beta G_n(w)] \pi(w)$ Here, $\beta > 0$ is an inverse temperature parameter and $\pi(w)$ is a prior over $w$ . SLT then predicts the shape and concentration of $p_\beta$ are governed locally by the LLC near each minimum of $G(w)$ .

3. Local Learning Coefficient (LLC) and Free Energy Asymptotics

The LLC $\lambda$ encodes the geometric complexity of the regret basin around a local minimum $w_0$ : $\lambda(w_0) = \lim_{r \rightarrow 0^+} \lambda(B_r(w_0))$ where $B_r(w_0)$ is an $r$ -neighborhood of $w_0$ . The LLC quantifies the “order” of vanishing of the regret at the minimum, serving as a complexity penalty in the free energy equation. The main result (Theorem 2.3) for the RL Gibbs posterior states:

Evidence asymptotics: $Z_{n, \beta}(U) \sim C(U, \beta) n^{-\lambda(U)} (\log n)^{m(U)-1}$
Posterior expected loss: $E_{p_\beta}[n G_n(w) | w \in U] = n G_n(w_0) + \lambda(U) \log n + O_p(1)$

The posterior will shift concentration from a simple (low-regret, low-LLC) policy $w_1$ to a more complex (lower-regret, higher-LLC) policy $w_2$ as $n$ increases, at a critical sample size $n^*$ given by

$n^* (G_1 - G_2) \simeq (\lambda_2 - \lambda_1) \log n^*$

marking the Bayesian phase transition.

4. Empirical Demonstration in Stagewise RL

These theoretical phase transitions were empirically studied in the "cheese-in-the-corner" (CITC) gridworld environment. The RL agent, implemented as a deep policy network (15 convolutional layers plus MLP, trained by vanilla REINFORCE), exhibits the following phasewise policy evolution:

Phase 1: Random up/left actions (high regret, simple)
Phase 2a: Deterministic up/left moves (still goal-location agnostic)
Phase 2b: Deterministically toward top-left, transitioning if cheese encountered
Phase 3: Direct, optimal path to cheese (lowest regret, highest complexity)

Regret and LLC were estimated via preconditioned SGLD (pSGLD), employing a strong Gaussian prior at selected training checkpoints. Observed dynamics show "opposing staircases": regret plateaus followed by sharp drops, while LLC jumps upward at each transition. For runs with $\alpha = 0.68, \gamma = 0.975$ : | Phase | Mean LLC | Mean Regret | |---------|-------------------|---------------| | Phase 1 | $30.9 \pm 3.4$ | $\sim0.7$ | | Phase 2b| $106.8 \pm 16.2$ | $\sim0.4$ | | Phase 3 | $561.5 \pm 40.5$ | $\sim0$ |

This discrete staircasing empirically verifies the predicted Bayesian phase transitions (Elliott et al., 12 Jan 2026).

5. LLC as a Geometric Indicator and OOD Transition Detector

LLC estimation detects latent transitions between policy classes, even when regret-based metrics are blind. For instance, when regret is computed only over a subset of states (e.g., cheese always in the corner), Phase 2b and Phase 3 policies both achieve zero regret, but the LLC estimator sharply distinguishes the true transition. This demonstrates the LLC’s sensitivity to the underlying basin geometry in parameter space, not just behavioral performance.

6. Broader Implications and Research Extensions

The geometric regret–complexity view yields several implications:

Bayesian RL posterior mass may prefer simpler, suboptimal policies when sample sizes are small, resulting in goal misgeneralization or reward hacking behaviors for finite datasets.
Lower-regret policies can paradoxically be less Bayesian-optimal if they entail higher LLC (complexity penalty).
LLC is implicated in safe/robust RL algorithm design, out-of-distribution failure prediction via state-marginalized LLCs, and understanding instrumental convergence in policy space.
Extensions include layerwise LLC measurement (as in Wang et al., 2025), sensitizing LLC and posterior metrics to hyperparameter changes, relaxing deterministic MDP assumptions per Watanabe’s relative variance conditions, and connecting theoretical posterior concentration to empirically observed SGD stagewise dynamics.

This suggests a geometric theory of RL learning, where policy phase transitions are predictable via singular learning theory, supporting ambitious development of theory-driven diagnostics and inductive bias control in reinforcement learning systems (Elliott et al., 12 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Stagewise Reinforcement Learning and the Geometry of the Regret Landscape (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Bayesian Phase Transitions in Reinforcement Learning.