Latent Policy Optimization via IIB-LPO
- The paper introduces a novel iteration of the information bottleneck that mitigates exploration collapse by enabling latent branching of reasoning trajectories.
- It employs CVAE-based latent code sampling and pseudo self-attention injection to maintain concise, diverse outputs while avoiding over-optimization.
- Empirical results on mathematical reasoning benchmarks demonstrate improved accuracy and diversity with reduced verbosity compared to entropy-based methods.
Latent Policy Optimization via Iterative Information Bottleneck (IIB-LPO) is an approach for enhancing exploration in reinforcement learning with verifiable rewards (RLVR), particularly in LLM reasoning tasks. It addresses the pervasive issue of "exploration collapse," where random rollouts become semantically homogeneous, leading to mode collapse and over-optimized, structurally identical reasoning paths. Unlike prior methods that rely on entropy regularization—which may induce reward hacking and vacuous verbosity—or token-selective updates, IIB-LPO shifts the exploration paradigm to topological branching of reasoning trajectories. The method leverages the Information Bottleneck (IB) principle as both a trajectory filter and a self-reward mechanism, promoting diversity and conciseness while avoiding overfitting to prompt wording or model biases (Deng et al., 9 Jan 2026).
1. Problem Motivation and Exploration Collapse
IIB-LPO is motivated by failures of existing RLVR methods in LLM reasoning. Random trajectory sampling in these settings leads to "mode collapse," where outputs are superficially different but structurally similar due to the strong inductive bias of pre-trained models. Entropy-based global regularization may encourage reward hacking, generating verbose yet vacuous answers, while local (token-level) entropy updates are overwhelmed by model priors. IIB-LPO circumvents these pitfalls by introducing explicit structural diversity through latent branching at points of maximal uncertainty, then selecting for informative and concise trajectories through an information bottleneck filter.
2. Theoretical Framework: Information Bottleneck and Entropy-Driven Branching
IIB-LPO treats each reasoning trajectory as a bottleneck between prompt and final answer . The IB objective formalizes the trade-off:
where denotes mutual information, and weights informativeness against compression. The IB score is . Under standard autoregressive and correct-answer assumptions, can be approximated per-trajectory as:
where is the RL advantage at step and is the policy entropy. This auxiliary reward, averaged over a pruned set of top-scoring trajectories, augments the main policy objective. The total loss for policy parameters thus becomes:
where denotes a PPO-like loss, is the IB weight, is the number of retained trajectories, and is the IB-pruned set.
In contrast to statistical perturbations of output distributions, IIB-LPO implements topological exploration by bifurcating a single trajectory into continuations at high-entropy states—selected as the top 5% highest-entropy steps in a rollout, identified with threshold at the 95th percentile over all steps. Contexts preceding these split points become seeds for further distinct continuations.
3. Methodology: Latent Branching, CVAE Sampling, and PSA Injection
Latent branching is triggered at entropy-identified split points. For each context preceding a high-entropy token in a base trajectory, latent codes are sampled from a pre-trained Conditional Variational Autoencoder (CVAE) prior . Each is injected into the LLM using Pseudo Self-Attention (PSA):
- Adaptive Norm Modulation: At each transformer layer , weights are modulated as
where decays over time and aligns scale.
- Augmented Attention: Latent code is concatenated with key-value projections:
and attention is computed as
$PSA(Q, K'_j, V'_j) = \text{softmax}\left(\frac{QK'_j^T}{\sqrt{d_k}}\right) V'_j$
PSA thereby directly steers deep-layer intermediate representations.
The original (unsplit) trajectory is preserved, resulting in candidate continuations per rollout.
4. Trajectory Selection via IB Pruning and Self-Reward
After rollout and branching, the full set of candidate trajectories (with total size , where is the number of base rollouts) is evaluated using the IB score . The top trajectories by descending are retained as . These scores are averaged and included as an auxiliary reward in the policy gradient update, biasing the optimization towards structurally diverse but concise and informative trajectories.
This selection mechanism achieves a dual function: it prunes verbose, low-information responses and rewards trajectories that effectively compress the prompt while preserving answer-relevant details. As a result, exploration is shaped not merely by statistical entropy but by substantive information-theoretic criteria.
5. Implementation Details and Hyperparameters
IIB-LPO utilizes the following core variables and hyperparameters:
| Variable | Description | Typical Value |
|---|---|---|
| Input prompt | - | |
| Token sequence | - | |
| Initial rollout count | 4 | |
| Branching factor | 7 | |
| Number kept after IB pruning | 8 | |
| Token entropy at step | - | |
| Entropy threshold (95th percentile) | Computed | |
| Context before split | - | |
| CVAE latent code | 128 | |
| PSA injection decay | to | |
| IB trade-off parameter in | 2 | |
| RL advantage at step | - | |
| IB self-reward weight | 0.003 | |
| Explicit IB loss coefficient | 0.005 (if used) | |
| Learning rate | Policy optimization | $1e-6$ (Qwen2.5), $5e-6$ (Qwen3) |
| Batch size | Optimization batch | 128 |
| Response length | Max tokens output | 8192 |
| Prompt length | Max prompt tokens | 2048 |
| PPO clip range | in | - |
The approach is implemented on LLMs such as Qwen2.5-7B and Qwen3, with all training and injection mechanisms adhering to the modular structure outlined in the method.
6. Empirical Performance and Comparative Evaluation
Evaluation occurs on four mathematical reasoning datasets: AIME2025, AIME2024, MATH-500, and OlympiadBench. Metrics include Pass@1 and Pass@n accuracies, diversity scores (Distinct-n, $1$–Self-BLEU, $1$–Self-ROUGE), LLM-Judge outputs, average response length, and perplexity (PPL).
- On Qwen2.5-7B, IIB-LPO achieves up to accuracy over the best baseline (e.g., MATH-500 Pass@1 from ).
- Diversity improvements reach up to (Distinct-4, Self-BLEU).
- Average token length is maintained or reduced compared to entropy regularization, avoiding verbosity.
Baselines encompass entropy regularization (Entropy-Reg, Entropy-Adv), token-selective variants (KL-Cov, 80/20), and self-reward mechanisms (SPINE, SRLM). Ablation studies establish:
- Entropy-driven branching yields structurally greater diversity than random or likelihood-driven branching.
- PSA-based latent injection surpasses input-level or softmax layer fusion or omission.
- Full IB loss and pruning lower PPL (–11.7), while boosting accuracy (+1.8%) and diversity.
7. Significance, Advantages, and Trade-offs
IIB-LPO's primary advantage is its shift from token-level stochasticity to trajectory-level structural diversification, overcoming pre-trained inductive biases and generating genuinely varied reasoning templates. The dual role of the IB—both for trajectory pruning and as a self-reward—resolves the ambiguity–exploration trade-off inherent in simple entropy-based methods. PSA’s integration of latent codes into transformer intermediates targets problem-difficulty-sensitive attention heads, rather than dispersing gradients.
Trade-offs include increased computational demand due to -fold trajectory branching and IB scoring per rollout, as well as additional hyperparameter tuning requirements. Despite these costs, the principled information-theoretic formulation provides robust improvements in both accuracy and diversity across a range of mathematical reasoning benchmarks, without incurring the verbosity associated with global entropy regularization (Deng et al., 9 Jan 2026).