Papers
Topics
Authors
Recent
Search
2000 character limit reached

Latent Policy Optimization via IIB-LPO

Updated 16 January 2026
  • The paper introduces a novel iteration of the information bottleneck that mitigates exploration collapse by enabling latent branching of reasoning trajectories.
  • It employs CVAE-based latent code sampling and pseudo self-attention injection to maintain concise, diverse outputs while avoiding over-optimization.
  • Empirical results on mathematical reasoning benchmarks demonstrate improved accuracy and diversity with reduced verbosity compared to entropy-based methods.

Latent Policy Optimization via Iterative Information Bottleneck (IIB-LPO) is an approach for enhancing exploration in reinforcement learning with verifiable rewards (RLVR), particularly in LLM reasoning tasks. It addresses the pervasive issue of "exploration collapse," where random rollouts become semantically homogeneous, leading to mode collapse and over-optimized, structurally identical reasoning paths. Unlike prior methods that rely on entropy regularization—which may induce reward hacking and vacuous verbosity—or token-selective updates, IIB-LPO shifts the exploration paradigm to topological branching of reasoning trajectories. The method leverages the Information Bottleneck (IB) principle as both a trajectory filter and a self-reward mechanism, promoting diversity and conciseness while avoiding overfitting to prompt wording or model biases (Deng et al., 9 Jan 2026).

1. Problem Motivation and Exploration Collapse

IIB-LPO is motivated by failures of existing RLVR methods in LLM reasoning. Random trajectory sampling in these settings leads to "mode collapse," where outputs are superficially different but structurally similar due to the strong inductive bias of pre-trained models. Entropy-based global regularization may encourage reward hacking, generating verbose yet vacuous answers, while local (token-level) entropy updates are overwhelmed by model priors. IIB-LPO circumvents these pitfalls by introducing explicit structural diversity through latent branching at points of maximal uncertainty, then selecting for informative and concise trajectories through an information bottleneck filter.

2. Theoretical Framework: Information Bottleneck and Entropy-Driven Branching

IIB-LPO treats each reasoning trajectory r=(o1,…,oT)r = (o_1, \ldots, o_T) as a bottleneck between prompt qq and final answer aa. The IB objective formalizes the trade-off:

LIB(r)=I(q;r)−βI(r;a)L_{IB}(r) = I(q; r) - \beta I(r; a)

where I(⋅;⋅)I(\cdot;\cdot) denotes mutual information, and β>0\beta > 0 weights informativeness against compression. The IB score is SIB(r)=−LIB(r)S_{IB}(r) = -L_{IB}(r). Under standard autoregressive and correct-answer assumptions, SIB(r)S_{IB}(r) can be approximated per-trajectory as:

SIB(r)≈∑t=1TAt H(ot∣o<t,q)S_{IB}(r) \approx \sum_{t=1}^T A_t\, H(o_t \vert o_{<t}, q)

where AtA_t is the RL advantage at step tt and H(⋅)H(\cdot) is the policy entropy. This auxiliary reward, averaged over a pruned set of top-scoring trajectories, augments the main policy objective. The total loss for policy parameters θ\theta thus becomes:

J(θ)=JGRPO(θ)+γ⋅(1/N)∑r∈R∗SIB(r)J(\theta) = J_{GRPO}(\theta) + \gamma \cdot (1/N)\sum_{r \in R^*} S_{IB}(r)

where JGRPOJ_{GRPO} denotes a PPO-like loss, γ\gamma is the IB weight, NN is the number of retained trajectories, and R∗R^* is the IB-pruned set.

In contrast to statistical perturbations of output distributions, IIB-LPO implements topological exploration by bifurcating a single trajectory into K>1K > 1 continuations at high-entropy states—selected as the top 5% highest-entropy steps in a rollout, identified with threshold τ\tau at the 95th percentile over all steps. Contexts preceding these split points become seeds for further distinct continuations.

3. Methodology: Latent Branching, CVAE Sampling, and PSA Injection

Latent branching is triggered at entropy-identified split points. For each context ct∗c_{t^*} preceding a high-entropy token in a base trajectory, KK latent codes z(j)z^{(j)} are sampled from a pre-trained Conditional Variational Autoencoder (CVAE) prior p(z∣ct∗)p(z|c_{t^*}). Each z(j)z^{(j)} is injected into the LLM using Pseudo Self-Attention (PSA):

  • Adaptive Norm Modulation: At each transformer layer â„“\ell, weights are modulated as

wℓ′=wℓ+γ(t)⋅Projϕ(z(j))w_{\ell}' = w_{\ell} + \gamma(t) \cdot \text{Proj}_{\phi}(z^{(j)})

where γ(t)\gamma(t) decays over time and Projϕ\text{Proj}_{\phi} aligns scale.

  • Augmented Attention: Latent code z(j)z^{(j)} is concatenated with key-value projections:

Kj′=[z(j) K;K],Vj′=[z(j) V;V]K'_j = [z^{(j)}\, K; K],\quad V'_j = [z^{(j)}\, V; V]

and attention is computed as

$PSA(Q, K'_j, V'_j) = \text{softmax}\left(\frac{QK'_j^T}{\sqrt{d_k}}\right) V'_j$

PSA thereby directly steers deep-layer intermediate representations.

The original (unsplit) trajectory is preserved, resulting in K+1K+1 candidate continuations per rollout.

4. Trajectory Selection via IB Pruning and Self-Reward

After rollout and branching, the full set RR of candidate trajectories (with total size M⋅(K+1)M \cdot (K+1), where MM is the number of base rollouts) is evaluated using the IB score SIB(r)S_{IB}(r). The top NN trajectories by descending SIBS_{IB} are retained as R∗R^*. These scores are averaged and included as an auxiliary reward in the policy gradient update, biasing the optimization towards structurally diverse but concise and informative trajectories.

This selection mechanism achieves a dual function: it prunes verbose, low-information responses and rewards trajectories that effectively compress the prompt while preserving answer-relevant details. As a result, exploration is shaped not merely by statistical entropy but by substantive information-theoretic criteria.

5. Implementation Details and Hyperparameters

IIB-LPO utilizes the following core variables and hyperparameters:

Variable Description Typical Value
qq Input prompt -
rr Token sequence (o1,…,oT)(o_1,\dots,o_T) -
MM Initial rollout count 4
KK Branching factor 7
NN Number kept after IB pruning 8
HtH_t Token entropy at step tt -
Ï„\tau Entropy threshold (95th percentile) Computed
ct∗c_{t^*} Context before split -
zz CVAE latent code (dz=128)(d_z=128) 128
γ(t)\gamma(t) PSA injection decay 5⋅10−25\cdot10^{-2} to 5⋅10−45\cdot10^{-4}
β\beta IB trade-off parameter in LIBL_{IB} 2
AtA_t RL advantage at step tt -
γ\gamma IB self-reward weight 0.003
α\alpha Explicit IB loss coefficient 0.005 (if used)
Learning rate Policy optimization $1e-6$ (Qwen2.5), $5e-6$ (Qwen3)
Batch size Optimization batch 128
Response length Max tokens output 8192
Prompt length Max prompt tokens 2048
PPO clip range ϵ\epsilon in [0.2, 0.28][0.2,\,0.28] -

The approach is implemented on LLMs such as Qwen2.5-7B and Qwen3, with all training and injection mechanisms adhering to the modular structure outlined in the method.

6. Empirical Performance and Comparative Evaluation

Evaluation occurs on four mathematical reasoning datasets: AIME2025, AIME2024, MATH-500, and OlympiadBench. Metrics include Pass@1 and Pass@n accuracies, diversity scores (Distinct-n, $1$–Self-BLEU, $1$–Self-ROUGE), LLM-Judge outputs, average response length, and perplexity (PPL).

  • On Qwen2.5-7B, IIB-LPO achieves up to +5.3%+5.3\% accuracy over the best baseline (e.g., MATH-500 Pass@1 from 54.4%→81.5%54.4\% \rightarrow 81.5\%).
  • Diversity improvements reach up to +7.4%+7.4\% (Distinct-4, Self-BLEU).
  • Average token length is maintained or reduced compared to entropy regularization, avoiding verbosity.

Baselines encompass entropy regularization (Entropy-Reg, Entropy-Adv), token-selective variants (KL-Cov, 80/20), and self-reward mechanisms (SPINE, SRLM). Ablation studies establish:

  • Entropy-driven branching yields structurally greater diversity than random or likelihood-driven branching.
  • PSA-based latent injection surpasses input-level or softmax layer fusion or omission.
  • Full IB loss and pruning lower PPL (–11.7), while boosting accuracy (+1.8%) and diversity.

7. Significance, Advantages, and Trade-offs

IIB-LPO's primary advantage is its shift from token-level stochasticity to trajectory-level structural diversification, overcoming pre-trained inductive biases and generating genuinely varied reasoning templates. The dual role of the IB—both for trajectory pruning and as a self-reward—resolves the ambiguity–exploration trade-off inherent in simple entropy-based methods. PSA’s integration of latent codes into transformer intermediates targets problem-difficulty-sensitive attention heads, rather than dispersing gradients.

Trade-offs include increased computational demand due to KK-fold trajectory branching and IB scoring per rollout, as well as additional hyperparameter tuning requirements. Despite these costs, the principled information-theoretic formulation provides robust improvements in both accuracy and diversity across a range of mathematical reasoning benchmarks, without incurring the verbosity associated with global entropy regularization (Deng et al., 9 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Latent Policy Optimization via Iterative Information Bottleneck (IIB-LPO).