Bayesian Phase Transitions in RL
- Bayesian phase transitions are abrupt shifts in the posterior concentration over policy space, driven by training sample size and model complexity.
- The methodology leverages singular learning theory and the Local Learning Coefficient to dissect discrete changes in the regret landscape of deep RL agents.
- Empirical studies in stagewise RL validate the theory by showing distinct regret plateaus and sharp LLC jumps that indicate transitions between policy phases.
Bayesian phase transitions in reinforcement learning (RL) describe abrupt, qualitative changes in the concentration of the Bayesian posterior over policy space as the sample size or training time increases. This phenomenon, rooted in singular learning theory (SLT), hinges on the trade-off between policy accuracy (low regret) and complexity (as measured by the Local Learning Coefficient, LLC), and becomes evident when deep RL agents exhibit stagewise learning dynamics. A geometric lens—where the LLC is an invariant of the regret landscape—reveals that transitions in policy class are governed by the comparative interplay of regret and LLC, rather than regret alone.
1. Singular Learning Theory and the Bayesian Posterior
SLT generalizes classic statistical asymptotics for singular models, such as deep neural networks, where Fisher information is degenerate and the posterior need not be Gaussian. The SLT framework establishes that the free energy (log-evidence) governing the Bayesian posterior mass near a policy parameter decomposes into two primary terms: where is the empirical regret (loss) at the minimum, and is the LLC, also known as the real log canonical threshold. In regular models, by the Bernstein–von Mises theorem, the posterior distribution converges to a Gaussian with variance scaling as . In singular RL models, competing minima can exist due to complex geometries of the regret landscape, and the posterior shifts mass between them based on the free energy trade-off (Elliott et al., 12 Jan 2026).
2. Generalized Posterior in Deep Reinforcement Learning
Bayesian inference in RL is constructed via a generalized Gibbs posterior over policy parameters , leveraging observed trajectory data from on- or off-policy rollouts. The population regret is defined as
where is the cumulative discounted return and denotes the trajectory distribution under policy . The empirical regret, estimated via importance sampling, forms the basis for the posterior: Here, is an inverse temperature parameter and is a prior over . SLT then predicts the shape and concentration of are governed locally by the LLC near each minimum of .
3. Local Learning Coefficient (LLC) and Free Energy Asymptotics
The LLC encodes the geometric complexity of the regret basin around a local minimum : where is an -neighborhood of . The LLC quantifies the “order” of vanishing of the regret at the minimum, serving as a complexity penalty in the free energy equation. The main result (Theorem 2.3) for the RL Gibbs posterior states:
- Evidence asymptotics:
- Posterior expected loss:
The posterior will shift concentration from a simple (low-regret, low-LLC) policy to a more complex (lower-regret, higher-LLC) policy as increases, at a critical sample size given by
marking the Bayesian phase transition.
4. Empirical Demonstration in Stagewise RL
These theoretical phase transitions were empirically studied in the "cheese-in-the-corner" (CITC) gridworld environment. The RL agent, implemented as a deep policy network (15 convolutional layers plus MLP, trained by vanilla REINFORCE), exhibits the following phasewise policy evolution:
- Phase 1: Random up/left actions (high regret, simple)
- Phase 2a: Deterministic up/left moves (still goal-location agnostic)
- Phase 2b: Deterministically toward top-left, transitioning if cheese encountered
- Phase 3: Direct, optimal path to cheese (lowest regret, highest complexity)
Regret and LLC were estimated via preconditioned SGLD (pSGLD), employing a strong Gaussian prior at selected training checkpoints. Observed dynamics show "opposing staircases": regret plateaus followed by sharp drops, while LLC jumps upward at each transition. For runs with : | Phase | Mean LLC | Mean Regret | |---------|-------------------|---------------| | Phase 1 | | | | Phase 2b| | | | Phase 3 | | |
This discrete staircasing empirically verifies the predicted Bayesian phase transitions (Elliott et al., 12 Jan 2026).
5. LLC as a Geometric Indicator and OOD Transition Detector
LLC estimation detects latent transitions between policy classes, even when regret-based metrics are blind. For instance, when regret is computed only over a subset of states (e.g., cheese always in the corner), Phase 2b and Phase 3 policies both achieve zero regret, but the LLC estimator sharply distinguishes the true transition. This demonstrates the LLC’s sensitivity to the underlying basin geometry in parameter space, not just behavioral performance.
6. Broader Implications and Research Extensions
The geometric regret–complexity view yields several implications:
- Bayesian RL posterior mass may prefer simpler, suboptimal policies when sample sizes are small, resulting in goal misgeneralization or reward hacking behaviors for finite datasets.
- Lower-regret policies can paradoxically be less Bayesian-optimal if they entail higher LLC (complexity penalty).
- LLC is implicated in safe/robust RL algorithm design, out-of-distribution failure prediction via state-marginalized LLCs, and understanding instrumental convergence in policy space.
- Extensions include layerwise LLC measurement (as in Wang et al., 2025), sensitizing LLC and posterior metrics to hyperparameter changes, relaxing deterministic MDP assumptions per Watanabe’s relative variance conditions, and connecting theoretical posterior concentration to empirically observed SGD stagewise dynamics.
This suggests a geometric theory of RL learning, where policy phase transitions are predictable via singular learning theory, supporting ambitious development of theory-driven diagnostics and inductive bias control in reinforcement learning systems (Elliott et al., 12 Jan 2026).