Information Gain-based Policy Optimization (IGPO)

Updated 26 April 2026

Information Gain-based Policy Optimization (IGPO) is a reinforcement learning approach that uses intrinsic, information-theoretic rewards to reduce epistemic uncertainty in multi-turn decision-making.
It employs rigorous mathematical formulations—such as likelihood-based, entropy-based, and counterfactual KL methods—to compute dense, per-step rewards that enhance credit assignment and learning stability.
IGPO improves sample efficiency and overall policy performance in applications like LLM-based reasoning, search augmentation, and hierarchical dialogue tasks, as demonstrated by empirical benchmarks.

Information Gain-based Policy Optimization (IGPO) is an approach to reinforcement learning (RL) that employs information-theoretic reward signals, notably information gain (IG), to provide dense and targeted supervision for multi-turn, multi-step, or hierarchical decision processes. IGPO leverages intrinsic rewards based on model uncertainty reduction or empirical improvements in belief over the ground-truth, enhancing credit assignment and sample efficiency compared to sparse, outcome-only reward RL paradigms.

1. Foundations and Motivations

Standard RL methods for sequential decision problems, especially those involving LLMs or hierarchical policies, often rely on rewards delivered only upon episode or rollout termination. This reward sparsity yields weak credit assignment: early-stage reasoning, information-seeking, or exploration may be vital but invisible to the outcome-based reward function, degrading both learning stability and policy quality. IGPO directly addresses these limitations by furnishing immediate, fine-grained intrinsic rewards whenever the agent’s action measurably reduces its epistemic uncertainty about relevant variables (e.g., the environment, the ground-truth answer, or the user's intent) (Wang et al., 16 Oct 2025, Geishauser et al., 2021, Kong et al., 28 Feb 2026).

In multi-turn agentic settings (e.g., information-seeking dialogue, tool-augmented search, collaborative coding), IGPO’s dense per-turn supervision circumvents advantage collapse—where groups of trajectories with identical outcomes provide no learning signal—and improves sample efficiency and robustness (Wang et al., 16 Oct 2025, Kong et al., 28 Feb 2026).

2. Formalization: Information Gain as Intrinsic Reward

The IGPO paradigm encodes information gain in various mathematically rigorous forms, unified by their connection to reductions in uncertainty or increases in policy belief about the ground-truth. Key formulations include:

Likelihood-based Information Gain: In multi-turn LLM RL, the per-turn IG reward at turn $t$ is defined as the increase in the model’s probability of producing the correct answer $a^*$ given the trajectory prefix $(\tau_1,\dots,\tau_t)$ :

$r_t^{IG} = P_\theta(a^* \mid q, \tau_{1\!:\!t}) - P_\theta(a^* \mid q, \tau_{1\!:\!t-1})$

with $P_\theta(a^* \mid \cdot)$ estimated by averaging log-probabilities over the answer tokens (Wang et al., 16 Oct 2025).

Entropy-based and Kullback–Leibler (KL) Formulations: When framing the agent’s belief state $b_t$ as a distribution over semantic classes or possible slot values, IG is the reduction in Shannon entropy:

$IG_t = H(b_t) - H(b_{t+1})$

Alternatively, expected IG can be formulated as the expected KL divergence between new and old beliefs (Hu et al., 31 Jan 2026).

Mutual Information and Counterfactual KL: IGPO generalizations compute the mutual information between received feedback $O_t$ at turn $t$ and the next agent action, conditioned on history $H_t$ :

$a^*$ 0

Practically, this is estimated via log-probability differences of the agent’s next action under actual versus counterfactual (“masked” or randomized) feedback (Kong et al., 28 Feb 2026, Liang et al., 16 Apr 2026).

Jensen–Shannon Divergence in Dialogue: In hierarchical dialogue management, IG is computed as the JS divergence between belief state slot distributions before and after information-seeking actions. Reward is activated above a calibrated threshold for significant uncertainty reduction (Geishauser et al., 2021).

3. Algorithmic Implementation

IGPO augments or replaces the standard sparse outcome reward with step- or turn-level information-based intrinsic rewards, often combined with outcome-level supervision. The generic workflow across domains is as follows:

Rollout Sampling: For each query or initial state, sample $a^*$ 1 trajectories/episodes with the current policy.
Intrinsic Reward Computation:
- For each step or turn, compute IG by difference in ground-truth posterior, entropy, or counterfactual KL.
- In retrieval-augmented LLMs, run teacher-forced forward passes on both actual and counterfactual contexts.
Advantage Estimation:
- Normalize per-step intrinsic rewards within rollout groups.
- Optionally, fuse with outcome-based advantages using adaptive gates dependent on group outcome variance (Kong et al., 28 Feb 2026).
Policy Update: Use a policy-gradient variant (e.g., Group Relative Policy Optimization, PPO, ACER, or Dueling DDQN) to maximize expected sum of per-turn rewards (and KL-regularized to reference policies), with per-token or per-action advantages reflecting both outcome and IG signals (Wang et al., 16 Oct 2025, Geishauser et al., 2021).
Specializations:
- IG-Search (Liang et al., 16 Apr 2026): Step-level IG rewards for search queries, using randomized document baselines.
- InfoPO (Kong et al., 28 Feb 2026): Counterfactual IG for user-feedback turns, adaptive gating.
- FeudalGain (Geishauser et al., 2021): JS divergence-based intrinsic reward for information-gathering actions in hierarchical RL.
- InfoReasoner (Hu et al., 31 Jan 2026): Semantic entropy decrease via bidirectional textual entailment clustering.
- MF-HRL-IGM (Sifaou et al., 18 Sep 2025): Information gain per simulated batch to guide fidelity selection in multi-fidelity hybrid RL.

4. Theoretical Properties and Guarantees

IGPO is grounded in information theory, with key theoretical guarantees:

Non-negativity: Expected information gain is always non-negative due to monotonicity of entropy under Bayesian updates (Hu et al., 31 Jan 2026).
Telescoping Additivity: Per-step IG sums to global reduction in uncertainty across an episode, ensuring all local improvements are accounted for (Hu et al., 31 Jan 2026).
Error-Snowball Bound: Reduction in cumulative “snowball error” (residual uncertainty) is bounded by the cumulative IG reward, implying a direct link between maximized IG and improved decision accuracy (Wang et al., 16 Oct 2025).
Necessity for Task Success: In multi-class inference tasks, the expected cumulative IG lower-bounds the minimal uncertainty (via Fano’s inequality) required for reliable inference (Kong et al., 28 Feb 2026).
No-Regret Guarantees (MF-HRL-IGM): In hybrid offline/online RL under budget constraints, per-unit-cost maximization of conditional mutual information yields no-regret bounds relative to resource-optimal strategies (Sifaou et al., 18 Sep 2025).
Channel Monotonicity: Among available actions, those yielding higher expected IG (more informative “channels”) are always preferred for epistemic progress (Hu et al., 31 Jan 2026).

5. Empirical Performance and Comparisons

IGPO and its variants have demonstrated superior empirical results across diverse RL settings and benchmarks:

Method	Core Setting	Primary IG Reward	Empirical Improvements	Cited Papers
IGPO (LLM QA)	Multi-turn LLMs	Answer-prob gain	+4.8 F1 over DeepResearcher (3B); OOD +3-7 F1	(Wang et al., 16 Oct 2025)
IG-Search	Search-augmented LMs	Step-level doc IG	+1.6 EM over MR-Search (multi-hop gains)	(Liang et al., 16 Apr 2026)
InfoPO	User-centric LLM agents	Counterfactual KL IG	+14–16% on UserGym over RAGEN	(Kong et al., 28 Feb 2026)
FeudalGain	Hierarchical Dialogue	JS Divergence slot IG	97.7% mean success vs 96.4% baselines	(Geishauser et al., 2021)
InfoReasoner	Retrieval LM QA	Semantic entropy IG	+4.7 EM over Search-R1; up to +5.4 RAG	(Hu et al., 31 Jan 2026)
MF-HRL-IGM	Multi-fidelity RL	Mutual Info per cost	Top return across all cost budgets	(Sifaou et al., 18 Sep 2025)

A common finding is that dense, step-level IG-based rewards enable more stable learning, rapid convergence, and substantially better performance in long-horizon or information-centric RL settings. Ablations routinely show that removing the IG reward or outcome anchor degrades sample efficiency and final performance (Wang et al., 16 Oct 2025, Kong et al., 28 Feb 2026).

6. Implementation Considerations and Variants

IGPO approaches are model-agnostic and can be coupled with various RL algorithms (PPO, ACER, DDQN, GRPO). Critical practical details include:

Reward Normalization: Within-group standardization of per-turn rewards and advantages stabilizes learning and prevents reward hacking.
Policy Regularization: KL penalties to reference policies (e.g., SFT or previous policy snapshot) control exploration/exploitation tradeoffs and limit catastrophic drift.
Overhead: Most IG estimators require extra forward passes to compute IG (e.g., via counterfactuals or clustering), but efficient batching and teacher-forcing minimize computational overhead (<6.4% wall-clock increase in IG-Search, 1.6× in InfoPO) (Liang et al., 16 Apr 2026, Kong et al., 28 Feb 2026).
Intrinsic–Extrinsic Fusion: Adaptive gating (variance-based or task-specific) prevents reward hacking and ensures information seeking remains subordinate to task completion when outcome variance is predictive (Kong et al., 28 Feb 2026).
Hyperparameter Sensitivity: Excessive weighting of IG can lead to over-exploration and degrade performance on the main task; ablation studies guide optimal λ/β choices (Hu et al., 31 Jan 2026).

7. Applications and Variants Across Domains

IGPO has been applied to a wide array of RL contexts:

LLM-based Multi-Turn Reasoning and Search: IGPO enables dense credit assignment for web-search-augmented agents, guidance for query generation, and robust multi-hop QA performance (Wang et al., 16 Oct 2025, Liang et al., 16 Apr 2026, Hu et al., 31 Jan 2026).
User/Environment-Interactive Agents: InfoPO operationalizes IG-based credit in interactive, partially-specified user requests, collaborative programming, and customer-support dialogue (Kong et al., 28 Feb 2026).
Hierarchical Dialogue and Slot-Filling: FeudalGain delivers per-question JS divergence IG rewards for slot-based sub-policies, decoupling learning dynamics for information-gathering (Geishauser et al., 2021).
Hybrid Online/Offline RL with Multiple Simulators: IG per unit cost governs simulator/budget selection in hybrid RL, guaranteeing no-regret learning under budget constraints (Sifaou et al., 18 Sep 2025).
Semantic Epistemic Progress through Clustering: InfoReasoner unifies semantic entropy-based IG estimation—via clustering over bidirectional textual entailment—for agentic retrieval policies (Hu et al., 31 Jan 2026).

A plausible implication is that IGPO-style intrinsic rewards are broadly conducive to RL scenarios where reasoning, exploration, or multi-step evidence acquisition are essential, provided suitable feedback channels (likelihoods, entropies, beliefs) can be computed.

References:

(Wang et al., 16 Oct 2025): Information Gain-based Policy Optimization: A Simple and Effective Approach for Multi-Turn LLM Agents (Liang et al., 16 Apr 2026): IG-Search: Step-Level Information Gain Rewards for Search-Augmented Reasoning (Kong et al., 28 Feb 2026): InfoPO: Information-Driven Policy Optimization for User-Centric Agents (Hu et al., 31 Jan 2026): Optimizing Agentic Reasoning with Retrieval via Synthetic Semantic Information Gain Reward (Geishauser et al., 2021): What Does The User Want? Information Gain for Hierarchical Dialogue Policy Optimisation (Sifaou et al., 18 Sep 2025): Multi-Fidelity Hybrid Reinforcement Learning via Information Gain Maximization