Latent Policy Optimization via IIB-LPO

Updated 16 January 2026

The paper introduces a novel iteration of the information bottleneck that mitigates exploration collapse by enabling latent branching of reasoning trajectories.
It employs CVAE-based latent code sampling and pseudo self-attention injection to maintain concise, diverse outputs while avoiding over-optimization.
Empirical results on mathematical reasoning benchmarks demonstrate improved accuracy and diversity with reduced verbosity compared to entropy-based methods.

Latent Policy Optimization via Iterative Information Bottleneck (IIB-LPO) is an approach for enhancing exploration in reinforcement learning with verifiable rewards (RLVR), particularly in LLM reasoning tasks. It addresses the pervasive issue of "exploration collapse," where random rollouts become semantically homogeneous, leading to mode collapse and over-optimized, structurally identical reasoning paths. Unlike prior methods that rely on entropy regularization—which may induce reward hacking and vacuous verbosity—or token-selective updates, IIB-LPO shifts the exploration paradigm to topological branching of reasoning trajectories. The method leverages the Information Bottleneck (IB) principle as both a trajectory filter and a self-reward mechanism, promoting diversity and conciseness while avoiding overfitting to prompt wording or model biases (Deng et al., 9 Jan 2026).

1. Problem Motivation and Exploration Collapse

IIB-LPO is motivated by failures of existing RLVR methods in LLM reasoning. Random trajectory sampling in these settings leads to "mode collapse," where outputs are superficially different but structurally similar due to the strong inductive bias of pre-trained models. Entropy-based global regularization may encourage reward hacking, generating verbose yet vacuous answers, while local (token-level) entropy updates are overwhelmed by model priors. IIB-LPO circumvents these pitfalls by introducing explicit structural diversity through latent branching at points of maximal uncertainty, then selecting for informative and concise trajectories through an information bottleneck filter.

2. Theoretical Framework: Information Bottleneck and Entropy-Driven Branching

IIB-LPO treats each reasoning trajectory $r = (o_1, \ldots, o_T)$ as a bottleneck between prompt $q$ and final answer $a$ . The IB objective formalizes the trade-off:

$L_{IB}(r) = I(q; r) - \beta I(r; a)$

where $I(\cdot;\cdot)$ denotes mutual information, and $\beta > 0$ weights informativeness against compression. The IB score is $S_{IB}(r) = -L_{IB}(r)$ . Under standard autoregressive and correct-answer assumptions, $S_{IB}(r)$ can be approximated per-trajectory as:

$S_{IB}(r) \approx \sum_{t=1}^T A_t\, H(o_t \vert o_{<t}, q)$

where $A_t$ is the RL advantage at step $t$ and $H(\cdot)$ is the policy entropy. This auxiliary reward, averaged over a pruned set of top-scoring trajectories, augments the main policy objective. The total loss for policy parameters $\theta$ thus becomes:

$J(\theta) = J_{GRPO}(\theta) + \gamma \cdot (1/N)\sum_{r \in R^*} S_{IB}(r)$

where $J_{GRPO}$ denotes a PPO-like loss, $\gamma$ is the IB weight, $N$ is the number of retained trajectories, and $R^*$ is the IB-pruned set.

In contrast to statistical perturbations of output distributions, IIB-LPO implements topological exploration by bifurcating a single trajectory into $K > 1$ continuations at high-entropy states—selected as the top 5% highest-entropy steps in a rollout, identified with threshold $\tau$ at the 95th percentile over all steps. Contexts preceding these split points become seeds for further distinct continuations.

3. Methodology: Latent Branching, CVAE Sampling, and PSA Injection

Latent branching is triggered at entropy-identified split points. For each context $c_{t^*}$ preceding a high-entropy token in a base trajectory, $K$ latent codes $z^{(j)}$ are sampled from a pre-trained Conditional Variational Autoencoder (CVAE) prior $p(z|c_{t^*})$ . Each $z^{(j)}$ is injected into the LLM using Pseudo Self-Attention (PSA):

Adaptive Norm Modulation: At each transformer layer $\ell$ , weights are modulated as

$w_{\ell}' = w_{\ell} + \gamma(t) \cdot \text{Proj}_{\phi}(z^{(j)})$

where $\gamma(t)$ decays over time and $\text{Proj}_{\phi}$ aligns scale.

Augmented Attention: Latent code $z^{(j)}$ is concatenated with key-value projections:

$K'_j = [z^{(j)}\, K; K],\quad V'_j = [z^{(j)}\, V; V]$

and attention is computed as

$PSA(Q, K'_j, V'_j) = \text{softmax}\left(\frac{QK'_j^T}{\sqrt{d_k}}\right) V'_j$

PSA thereby directly steers deep-layer intermediate representations.

The original (unsplit) trajectory is preserved, resulting in $K+1$ candidate continuations per rollout.

4. Trajectory Selection via IB Pruning and Self-Reward

After rollout and branching, the full set $R$ of candidate trajectories (with total size $M \cdot (K+1)$ , where $M$ is the number of base rollouts) is evaluated using the IB score $S_{IB}(r)$ . The top $N$ trajectories by descending $S_{IB}$ are retained as $R^*$ . These scores are averaged and included as an auxiliary reward in the policy gradient update, biasing the optimization towards structurally diverse but concise and informative trajectories.

This selection mechanism achieves a dual function: it prunes verbose, low-information responses and rewards trajectories that effectively compress the prompt while preserving answer-relevant details. As a result, exploration is shaped not merely by statistical entropy but by substantive information-theoretic criteria.

5. Implementation Details and Hyperparameters

IIB-LPO utilizes the following core variables and hyperparameters:

Variable	Description	Typical Value
$q$	Input prompt	-
$r$	Token sequence $(o_1,\dots,o_T)$	-
$M$	Initial rollout count	4
$K$	Branching factor	7
$N$	Number kept after IB pruning	8
$H_t$	Token entropy at step $t$	-
$\tau$	Entropy threshold (95th percentile)	Computed
$c_{t^*}$	Context before split	-
$z$	CVAE latent code $(d_z=128)$	128
$\gamma(t)$	PSA injection decay	$5\cdot10^{-2}$ to $5\cdot10^{-4}$
$\beta$	IB trade-off parameter in $L_{IB}$	2
$A_t$	RL advantage at step $t$	-
$\gamma$	IB self-reward weight	0.003
$\alpha$	Explicit IB loss coefficient	0.005 (if used)
Learning rate	Policy optimization	$1e-6$ (Qwen2.5), $5e-6$ (Qwen3)
Batch size	Optimization batch	128
Response length	Max tokens output	8192
Prompt length	Max prompt tokens	2048
PPO clip range	$\epsilon$ in $[0.2,\,0.28]$	-

The approach is implemented on LLMs such as Qwen2.5-7B and Qwen3, with all training and injection mechanisms adhering to the modular structure outlined in the method.

6. Empirical Performance and Comparative Evaluation

Evaluation occurs on four mathematical reasoning datasets: AIME2025, AIME2024, MATH-500, and OlympiadBench. Metrics include Pass@1 and Pass@n accuracies, diversity scores (Distinct-n, $1$–Self-BLEU, $1$–Self-ROUGE), LLM-Judge outputs, average response length, and perplexity (PPL).

On Qwen2.5-7B, IIB-LPO achieves up to $+5.3\%$ accuracy over the best baseline (e.g., MATH-500 Pass@1 from $54.4\% \rightarrow 81.5\%$ ).
Diversity improvements reach up to $+7.4\%$ (Distinct-4, Self-BLEU).
Average token length is maintained or reduced compared to entropy regularization, avoiding verbosity.

Baselines encompass entropy regularization (Entropy-Reg, Entropy-Adv), token-selective variants (KL-Cov, 80/20), and self-reward mechanisms (SPINE, SRLM). Ablation studies establish:

Entropy-driven branching yields structurally greater diversity than random or likelihood-driven branching.
PSA-based latent injection surpasses input-level or softmax layer fusion or omission.
Full IB loss and pruning lower PPL (–11.7), while boosting accuracy (+1.8%) and diversity.

7. Significance, Advantages, and Trade-offs

IIB-LPO's primary advantage is its shift from token-level stochasticity to trajectory-level structural diversification, overcoming pre-trained inductive biases and generating genuinely varied reasoning templates. The dual role of the IB—both for trajectory pruning and as a self-reward—resolves the ambiguity–exploration trade-off inherent in simple entropy-based methods. PSA’s integration of latent codes into transformer intermediates targets problem-difficulty-sensitive attention heads, rather than dispersing gradients.

Trade-offs include increased computational demand due to $K$ -fold trajectory branching and IB scoring per rollout, as well as additional hyperparameter tuning requirements. Despite these costs, the principled information-theoretic formulation provides robust improvements in both accuracy and diversity across a range of mathematical reasoning benchmarks, without incurring the verbosity associated with global entropy regularization (Deng et al., 9 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

IIB-LPO: Latent Policy Optimization via Iterative Information Bottleneck (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Latent Policy Optimization via Iterative Information Bottleneck (IIB-LPO).

Latent Policy Optimization via IIB-LPO

1. Problem Motivation and Exploration Collapse

2. Theoretical Framework: Information Bottleneck and Entropy-Driven Branching

3. Methodology: Latent Branching, CVAE Sampling, and PSA Injection

4. Trajectory Selection via IB Pruning and Self-Reward

5. Implementation Details and Hyperparameters

6. Empirical Performance and Comparative Evaluation

7. Significance, Advantages, and Trade-offs

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Latent Policy Optimization via IIB-LPO

1. Problem Motivation and Exploration Collapse

2. Theoretical Framework: Information Bottleneck and Entropy-Driven Branching

3. Methodology: Latent Branching, CVAE Sampling, and PSA Injection

4. Trajectory Selection via IB Pruning and Self-Reward

5. Implementation Details and Hyperparameters

6. Empirical Performance and Comparative Evaluation

7. Significance, Advantages, and Trade-offs

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research