Information-Directed Sampling (IDS) Policies

Updated 25 December 2025

IDS is a decision framework that optimizes an information ratio to balance exploitation (minimizing immediate regret) with exploration (maximizing information gain).
It is applicable across diverse settings such as bandits, reinforcement learning, and multi-agent games, achieving near-optimal regret bounds by adapting to problem structure.
Algorithmic variants use surrogate metrics like variance or KL-based bonuses to ensure computational tractability while effectively handling complex feedback and high-dimensional data.

Information-Directed Sampling (IDS) policies form a principled class of online decision strategies that balance exploitation—minimizing immediate regret—with targeted exploration, by optimizing the trade-off between predicted instantaneous regret and information gain. IDS provides a unifying conceptual and algorithmic framework across Bayesian and frequentist settings for multi-armed bandits, linear bandits, high-dimensional and structured bandits, graph-structured feedback, non-stationary environments, contextual decision problems, (partial) monitoring, and reinforcement learning, including multi-agent settings. IDS-based methods admit regret bounds that match or improve upon minimax rates, often adapting automatically to effective problem complexity and information structure.

1. Core Principle and Mathematical Formulation

Information-Directed Sampling is grounded in the optimization of an information ratio at each time step. At time $t$ and given data $\mathcal F_{t-1}$ , let $\Delta_t(a)$ denote the expected one-step regret for action $a$ —the difference between the expected reward for the optimal action and $a$ , under the current belief or confidence set. Let $g_t(a)$ denote the expected information gain from selecting $a$ , formalized as the mutual information between the optimal action or outcome and the next observation. IDS selects a randomized action distribution $\pi_t$ over actions $\mathcal{A}$ to minimize the information ratio:

$\Psi_t(\pi_t) = \frac{\left[ \mathbb{E}_{a\sim\pi_t} \Delta_t(a) \right]^2}{\mathbb{E}_{a\sim\pi_t} g_t(a)}$

and samples $A_t \sim \pi_t$ (Russo et al., 2014). In special cases (e.g., two-armed bandit, finite actions), the minimizer has support on at most two actions (Zhou, 2015).

The information gain $g_t(a)$ is always chosen relative to a learning objective, e.g., the identity of the optimal action, the full model parameter $\theta$ , or a rate-distortion-compressed "surrogate" of the environment (Hao et al., 2022). This flexibility makes IDS amenable to a variety of statistical decision domains.

2. Regret Analysis and Minimax Rates

An essential feature of IDS is that minimizing the information ratio, under appropriate conditions, yields sublinear Bayes or worst-case regret that matches classical lower bounds up to logarithmic factors. In the standard Bayesian bandit model, Russo and Van Roy established that if $\Psi_t(\pi_t) \le \lambda$ uniformly in $t$ , the expected cumulative Bayesian regret over $T$ rounds satisfies

$\mathbb{E} \left[ \sum_{t=1}^T \Delta_t(A_t) \right] \leq \sqrt{\lambda H(\alpha_1) T}$

where $H(\alpha_1)$ is the prior entropy of the optimal action (Russo et al., 2014, Zhou, 2015).

In more general models:

Linear bandits: $\lambda \leq d/2$ , where $d$ is the dimension; hence $\mathbb{E}[\mathrm{Regret}(T)] = O(\sqrt{dT \log K})$ (Russo et al., 2014).
Graph feedback bandits: regret $O(\sqrt{\chi(G) T})$ , where $\chi(G)$ is the clique cover number (Liu et al., 2017).
Partial monitoring: regret $O(\sqrt{n})$ or $O(n^{2/3})$ depending on local/global observability (Chakraborty et al., 2023, Kirschner et al., 2020).
Structured/sparse linear bandits: adapts to effective sparsity, attaining $O(\min\{\sqrt{sdT}, (sT)^{2/3}\})$ for sparsity level $s$ (Schwartz et al., 28 Oct 2025).
RL in finite MDPs: with suitable information targets, Bayesian regret bounds are $O(\sqrt{S^3A^2H^4L \log(SLH)})$ in the tabular case (Hao et al., 2022), $O(dH^2\sqrt{L})$ in linear MDPs (Hao et al., 2022); improved $O(\tilde H^2 \sqrt{SAT})$ for RLHF (Qi et al., 8 Feb 2025).
Multi-agent zero-sum Markov games: $\tilde O(\sqrt{K})$ regret in episodic games, matching the information-theoretic bound (Zhang et al., 30 Apr 2024).

The analysis exploits the property that the total information gain across rounds is globally bounded (by the problem entropy), and Cauchy-Schwarz links the sum of regrets to the sum of information gains (Russo et al., 2014, Hao et al., 2022).

3. Algorithmic Instantiations and Structural Models

Numerous algorithmic variants of IDS have been developed for practical and theoretical tractability:

Bayesian IDS directly optimizes the canonical information ratio with mutual information or, when intractable, variance-based surrogates (variance of posterior-mean rewards) (Russo et al., 2014).
Frequentist IDS replaces Bayesian probabilities with high-probability confidence sets (ellipsoidal or empirical Bernstein bounds); information gains may target parameter uncertainty or action gaps (Kirschner et al., 2020, Kirschner et al., 2018).
Sparse and Structured IDS uses sparsity-promoting priors (e.g., spike-and-slab) for high-dimensional bandits (Hao et al., 2021) or optimistic posteriors and learning-rate schedules for frequentist adaptivity (Schwartz et al., 28 Oct 2025).
Graph-structured and Feedback-aware IDS incorporates the observation graph into the information gain, scaling regret with clique cover rather than action count (Liu et al., 2017).
Partial Monitoring IDS extends the information ratio to general observation/reward mappings; minimax-optimal across all observability regimes (Kirschner et al., 2020, Chakraborty et al., 2023).
Reinforcement Learning IDS optimizes policy-level information ratio: selecting policies to minimize per-episode squared expected regret over information gain about the environment (or a surrogate). Regularized or additive forms with tractable KL-bonus reward modifications are used in practice (Hao et al., 2022, Qi et al., 8 Feb 2025).
Alternative Exploration Metrics: When mutual information is intractable, surrogates include posterior variance (variance IDS), rate-distortion compressed learning targets, or Stein-based information metrics in model-based RL (Chakraborty et al., 2023).
Contextual and Multi-agent Extensions: IDS is extended to contextual bandits with globally optimal regret scaling under context-weighted information ratios (Hao et al., 2022); multi-agent IDS employs (joint and marginal) information ratios over joint policy spaces and targets Nash or CCE learning (Zhang et al., 30 Apr 2024).

Algorithmically, the minimization over action distributions is always convex; for finite actions, the optimizer has support on two actions (Russo et al., 2014, Liu et al., 2017). Regularized IDS and actor-critic approximations offer computationally scalable alternatives for large policy spaces (Hao et al., 2022, Hao et al., 2022).

4. Key Theoretical and Practical Insights

Automatic Adaptivity: IDS adaptively balances exploration and exploitation depending on the tightness of the information structure—focusing exploration on options with high impact on uncertainty about optimal decisions, often outperforming UCB or Thompson Sampling in simulations and theory (Russo et al., 2014, Liu et al., 2017, Hao et al., 2021).
Instance-Optimality: IDS policies achieve near-optimal regret scaling in both data-poor (early, high uncertainty) and data-rich (late, small uncertainty) regimes without manual regime adaptation (Hao et al., 2021, Schwartz et al., 28 Oct 2025).
Graph/Feedback Structure: By integrating observation graphs or feedback matrices, IDS bounds and empirical performance improve from $O(\sqrt{KT})$ to $O(\sqrt{\chi(G) T})$ or $O(\sqrt{(K/E[\#\mathrm{obs}])T \log K})$ (Liu et al., 2017).
Computational Tractability: For large or continuous action spaces, mutual information is replaced by algorithmically tractable surrogates (variance, empirical information gain, Stein discrepancy, etc.), and solved via convex programs or Mixture/Monte Carlo approximation (Chakraborty et al., 2023, Qi et al., 8 Feb 2025).
Surrogate and Regularized Objectives: In large-scale RL, maximizing value plus a KL-based info-bonus (as in regularized-IDS) is shown to yield the same order of regret as the harder ratio-form (Hao et al., 2022, Qi et al., 8 Feb 2025).
Contextual Anticipation: Contextual IDS that incorporates the context distribution (not just the current context) sometimes substantially outperforms "conditional" myopic variants, especially in tasks that require information to generalize across future contexts (Hao et al., 2022).
Multi-agent Information Targets: IDS in Markov games requires optimizing joint and marginal information ratios over joint policies to efficiently converge to Nash (or CCE) equilibria, and can leverage rate-distortion to reduce learning targets and computation (Zhang et al., 30 Apr 2024).
Non-stationarity, Norm-agnosticity, and Practicalities: IDS has been generalized to dynamic environments (with change-point/shift detection and recovery) (Liu et al., 2022), as well as norm-agnostic settings where parameter norm bounds are unknown and adaptively estimated online (Suder et al., 7 Mar 2025).

5. Applications and Empirical Outcomes

IDS methods have demonstrated robust empirical superiority or parity over classic approaches (UCB, Thompson Sampling, etc.) in a wide array of settings:

Domain	IDS Regret vs. Baselines	Notes
Multi-armed bandits (Bernoulli/Gaussian)	15–30% lower than TS/UCB (Russo et al., 2014)	O( $\sqrt{KT \log K}$ ) scaling
Sparse/high-d bandits	20–50% lower than TS/LinUCB/ETC (Hao et al., 2021)	Adaptively tracks $O(\sqrt{sdT})$ / $O((sT)^{2/3})$
Graph feedback bandits	30–50% lower than TS/UCB/Exp3 (Liu et al., 2017)	Regret $\propto$ clique-cover, not action set size
Partial monitoring (various observability)	Achieves lower bound scaling (Chakraborty et al., 2023)
RL (tabular, RLHF, GFlowNet, MBRL)	Outperforms TS and OFU in regret and sample efficiency (Qi et al., 8 Feb 2025, Chakraborty et al., 2023, Chakraborty et al., 2023)	Sublinear Bayesian regret, tractable Approximate-IDS
Dynamic pricing (nonstationary market)	Lower cumulative regret, faster recovery (Liu et al., 2022)	Rigorous shift/adaptation procedures
Multi-agent RL (zero-sum MGs)	Achieves $\tilde O(\sqrt{K})$ Bayes regret (Zhang et al., 30 Apr 2024)

Empirically, IDS methods show consistent early-round and long-horizon improvements over Thompson Sampling or UCB, especially in scenarios where exploration must be efficiently targeted or where problem structure can be exploited.

6. Extensions, Limitations, and Future Directions

Surrogate Information Metrics: IDS performance critically depends on the choice and tractability of the information gain metric. In practical RL/large state-action spaces, mutual information is commonly approximated by surrogates (variance, Stein discrepancy, compressed environments, etc.) (Chakraborty et al., 2023, Qi et al., 8 Feb 2025). Accurate estimation remains computationally intensive.
Computational Complexity: For large action or policy spaces, optimization (or even Monte Carlo approximation) of the information ratio becomes a bottleneck. Support-2 structure alleviates but does not remove this issue.
Beyond Bandits and RL: Extensions of IDS have begun to address best- $k$ identification (You et al., 2022), pure exploration, dueling bandits, combinatorial action sets, non-stationarity, and multi-agent learning. In particular, multi-objective and CCE/NE learning uses joint information ratios as learning objectives (Zhang et al., 30 Apr 2024).
Theory–Computation Gap: Theoretically optimal IDS policies are often computationally intractable in rich RL environments. Regularized/additive formulations, context-weighted or surrogate object-based approximations, and deep actor–critic variants have been proposed (Hao et al., 2022, Qi et al., 8 Feb 2025), but regret analysis under practical constraints remains an active research area.
Unification and Primal–Dual Connections: IDS can be interpreted as a primal–dual approach to the information-constrained regret minimization problem, aligning with lower-bound programs and dual variable allocation in asymptotic analyses (Kirschner et al., 2020).
Open Problems: Efficient IDS in continuous and infinite action spaces, optimal surrogate metrics, interaction with prior-misspecification and model selection, and hybrid RL/SL scenarios (e.g., RLHF for LLMs) are active topics (Qi et al., 8 Feb 2025).

7. Selected Algorithmic Prototypes and Practical Guidelines

IDS algorithmic instantiations typically follow the paradigm:

Belief/Confidence Update: Maintain posterior or confidence set over model parameters or environments (e.g., via Bayesian inference, empirical Bayes, or least-squares estimation).
Regret and Information Gain Estimation: For each candidate action or policy (or distribution thereon), estimate expected one-step regret and (approximate) information gain.
Optimization of Information Ratio: Solve, often by convex programming or enumeration on small supports, for the distribution or decision minimizing the information ratio.
Action/Policy Selection and Execution: Sample according to the resultant distribution, observe feedback, and update belief.

For sparse or high-dimensional regimes, use empirical Bayes or spike-and-slab priors for efficient posterior sampling (Hao et al., 2021, Schwartz et al., 28 Oct 2025). For RL or adversarial/multi-agent environments, construct regularized reward functions with KL or surrogate information bonuses (Chakraborty et al., 2023, Zhang et al., 30 Apr 2024). When computationally constrained, adopt approximate policies (e.g., using neural actors/critics), or employ additive regularized-IDS objectives that upper-bound the ratio-based regret (Hao et al., 2022, Qi et al., 8 Feb 2025).

References:

(Russo et al., 2014) "Learning to Optimize via Information-Directed Sampling"
(Zhou, 2015) "A Note on Information-Directed Sampling and Thompson Sampling"
(Liu et al., 2017) "Information Directed Sampling for Stochastic Bandits with Graph Feedback"
(Kirschner et al., 2020) "Asymptotically Optimal Information-Directed Sampling"
(Hao et al., 2021) "Information Directed Sampling for Sparse Linear Bandits"
(Hao et al., 2022) "Regret Bounds for Information-Directed Reinforcement Learning"
(You et al., 2022) "Information-Directed Selection for Top-Two Algorithms"
(Liu et al., 2022) "Non-Stationary Dynamic Pricing Via Actor-Critic Information-Directed Pricing"
(Xu et al., 2022) "Adaptive Sampling for Discovery"
(Chakraborty et al., 2023) "STEERING: Stein Information Directed Exploration for Model-Based Reinforcement Learning"
(Zhang et al., 30 Apr 2024) "Provably Efficient Information-Directed Sampling Algorithms for Multi-Agent Reinforcement Learning"
(Qi et al., 8 Feb 2025) "Sample-Efficient Reinforcement Learning from Human Feedback via Information-Directed Sampling"
(Schwartz et al., 28 Oct 2025) "Sparse Optimistic Information Directed Sampling"
(Hirling et al., 23 Dec 2025) "Information-directed sampling for bandits: a primer"
(Kirschner et al., 2018) "Information Directed Sampling and Bandits with Heteroscedastic Noise"
(Kirschner et al., 2020) "Information Directed Sampling for Linear Partial Monitoring"
(Kirschner et al., 2023) "Linear Partial Monitoring for Sequential Decision-Making: Algorithms, Regret Bounds and Applications"
(Hao et al., 2022) "Contextual Information-Directed Sampling"