Papers
Topics
Authors
Recent
Search
2000 character limit reached

Information-Theoretic Action Selection

Updated 15 December 2025
  • Information-Theoretic Action Selection is a framework where agents optimize criteria such as expected information gain and mutual information alongside utility to balance exploration and exploitation.
  • It integrates methodologies from bandit learning, reinforcement learning, Bayesian planning, and active inference to derive formal exploration-exploitation strategies with practical regret bounds.
  • The approach has broad applications in decentralized multi-agent systems and robotic perception, employing techniques like negative entropy rewards and submodular selection for efficient decision-making.

Information-theoretic action selection refers to algorithms and theoretical frameworks in which agents choose actions by explicitly optimizing information-theoretic criteria—such as expected information gain, mutual information, or rate-distortion—either alone, or in combination with classical utility. This paradigm encompasses action selection for exploration, Bayesian experimental design, model-based active inference, decentralized perception, and multi-agent coordination. Research from bandit learning, reinforcement learning, Bayesian planning, and robot perception all converge on the use of these metrics to formally balance epistemic (exploratory, uncertainty-reducing) and pragmatic (reward-seeking) objectives.

1. Fundamental Criteria: Information Gain, Mutual Information, and Expected Free Energy

Information-theoretic action selection is anchored in the maximization of information gain, typically expressed as reductions in uncertainty (entropy) about latent variables of interest. For a system with latent state variable ss and observable outcome oo under a candidate policy π\pi, the expected free energy framework provides a unified formalism:

G(π)=EQ(o,sπ)[lnQ(sπ)lnP(o,s)]G(\pi) = \mathbb{E}_{Q(o,s|\pi)} [\ln Q(s|\pi) - \ln P(o,s)]

This admits several equivalent decompositions:

  • Risk + Ambiguity:

G(π)=DKL[Q(sπ)P(s)]+EQ(sπ)[H[P(os)]]G(\pi) = D_{KL}[Q(s|\pi)\|P(s)] + \mathbb{E}_{Q(s|\pi)} \left[ H[P(o|s)] \right]

where the first term penalizes deviation from preferred states and the second captures expected outcome entropy (ambiguity).

  • Expected Utility – Information Gain:

G(π)=EQ(oπ)[lnP(o)]EQ(o,sπ)[lnQ(so,π)Q(sπ)]G(\pi) = -\mathbb{E}_{Q(o|\pi)} [\ln P(o)] - \mathbb{E}_{Q(o,s|\pi)} \left[ \ln \frac{Q(s|o,\pi)}{Q(s|\pi)} \right]

The first term corresponds to extrinsic value (expected utility), while the second is the expected information gain (intrinsic epistemic value).

Action selection rules derived from these criteria naturally interpolate between pure reward-seeking, pure exploration, and hybrid strategies:

Selection Principle Formal Rule Asymptotic Behavior
Active Inference π=argminπG(π)\pi^* = \arg\min_\pi G(\pi) Balanced exploration/ exploitation
Bayesian Optimal Design πIG=argmaxπI(s;oπ)\pi^*_{IG} = \arg\max_\pi I(s;o|\pi) (if U=0U=0) Intrinsically motivated exploration
Bayesian Decision Theory πEU=argmaxπE[U(o)]\pi^*_{EU} = \arg\max_\pi \mathbb{E}[U(o)] (if ambiguity 0\rightarrow 0) Pure exploitation

Selection between these regimes governs emergent exploration–exploitation trade-offs (Sajid et al., 2021).

2. Information-Theoretic Bounds in Bandit and RL Settings

Bayesian and nonstationary bandit problems illustrate the power and limitations of information-theoretic selection:

  • Classical Entropy-based Regret Bounds: Russo and Van Roy’s regret bounds for Bayesian bandit algorithms are proportional to ΓH(A)T\sqrt{\Gamma H(A^*) T}, where H(A)H(A^*) is the entropy of the optimal arm under the prior and Γ\Gamma the information-ratio (Dong et al., 2018). However, H(A)H(A^*) can grow without bound in high-cardinality or continuous-action settings.
  • Rate-Distortion–based Regret Bounds: By quantifying the minimal mutual information needed to select a near-optimal action (given an error tolerance DD), rate-distortion theory yields much tighter regret bounds:

BayesRegret(T)ΓR(ϵ)T+ϵT\mathrm{BayesRegret}(T) \leq \sqrt{\overline{\Gamma} R(\epsilon) T} + \epsilon T

where R(ϵ)R(\epsilon) is the rate-distortion function:

R(ϵ)=minq(θ^θ)I(θ;θ^)s.t.E[d(θ,θ^)]ϵR(\epsilon) = \min_{q(\hat\theta|\theta)} I(\theta; \hat\theta)\quad \text{s.t.} \quad \mathbb{E}[d(\theta, \hat\theta)] \leq \epsilon

In linear and logistic bandits, this leads to O(dTlogT)O(d\sqrt{T \log T}) regret with dimension dd—independent of action set cardinality (Dong et al., 2018).

  • Entropy Rate for Nonstationary Environments: For temporally evolving optimal actions, the per-step entropy rate Hˉ(A)\bar{H}_\infty(A^*) of the latent optimal-action process tightly controls the achievable regret:

Δˉ(π)Γ(π)Hˉ(A)\bar\Delta_\infty(\pi) \leq \sqrt{\Gamma(\pi)\, \bar{H}_\infty(A^*)}

Efficient algorithms like Thompson Sampling achieve near-optimal regret if and only if the process AA^* remains sufficiently predictable (Min et al., 2023).

3. Algorithmic Realizations in RL and Perception

3.1 Model-Free Information-Theoretic RL

Reinforcement learning can directly embed negative entropy or information gain as the reward signal:

rt=H(yt+1z1:t+1,x1:t+1)r_t = - H(y_{t+1} | z_{1:t+1}, x_{1:t+1})

Such formulations support deep Q-network–based agents that learn action selection for active information acquisition, matching or exceeding the performance of model-based planners in multi-target tracking and perception tasks, while being model-free and horizon-unlimited (Jeong et al., 2019).

3.2 Multi-Agent and Decentralized Perception

In decentralized multi-agent systems (Dec-POMDPs), selection of both agents and agent policies is optimized by maximizing the mutual information between the latent variable XX (e.g., world trajectory or secret) and the joint observation histories, using a two-layer greedy-submodular algorithm (IMAS2^2):

maxKN,K=k πKI(X;YK,MπK)\max_{\substack{\mathcal{K} \subset \mathcal{N},\,|\mathcal{K}|=k \ \bm{\pi}_\mathcal{K}}} I(X; \bm{Y}_\mathcal{K}, M_{\bm{\pi}_\mathcal{K}})

Submodularity ensures a $1-1/e$ performance guarantee for the greedy selection of agents and their policies (Shi et al., 22 Oct 2025).

4. Action Spaces and Information-Theoretic Actuation

Beyond the classical view where the agent’s action is considered an atomic primitive, action selection can be modeled as a compressed coding process:

  • Internal–External Action Decomposition: External actions aa are generated by internal bit-sequences qq via an arithmetic decoder DρD_\rho informed by an action model ρ(as)\rho(a|s). The code-length ρ(as)=log2ρ(as)\ell_\rho(a|s) = -\log_2 \rho(a|s) directly expresses the information-theoretic cost.
  • Augmented MDPs: An internal MDP ϑ\vartheta over internal deliberative states (s,q)(s, q) admits Bellman equations and value congruence such that optimization at the internal level translates to rational action selection at the external interface:

Qϑπ(s,q,b)=QμΠ(s,a)Q_\vartheta^\pi(s, q, b) = Q_\mu^\Pi(s, a)

This formalism naturally incorporates sequence-model priors for multitask RL and regularizes exploration via KL divergence to the action prior (Catt et al., 2021).

5. Divergence Criteria and Practical Planning Algorithms

In practical robotic perception and active data acquisition, action selection is typically computed by maximizing closed-form information divergences between pre- and post-action beliefs. Under Gaussian approximations, standard metrics include Kullback–Leibler divergence, Rényi divergence, Bhattacharyya distance, Fisher information metric, and Wasserstein distance. One-step look-ahead approximations are used for efficiency:

EIG(a)D(N(μ+,Σ+)N(μ,Σ))\mathrm{EIG}(a) \approx D(\mathcal{N}(\mu^+, \Sigma^+) \| \mathcal{N}(\mu, \Sigma))

Empirical comparisons show that all such criteria can drive rapid convergence to high-precision state estimates with sparse observations, with minor variance in convergence speed and final error (Murali et al., 2021).

Divergence Metric Formula (Gaussian case) Primary Utility
KL divergence 12[logΣjΣid+tr(Σj1Σi)+(μiμj)Σj1(μiμj)]\frac{1}{2}\left[\log\frac{|\Sigma_j|}{|\Sigma_i|} - d + \mathrm{tr}(\Sigma_j^{-1}\Sigma_i) + (\mu_i-\mu_j)^\top\Sigma_j^{-1}(\mu_i-\mu_j)\right] Standard MI/I-divergence
Rényi divergence See data (Murali et al., 2021) Parametric sensitivity
Bhattacharyya See data Robust similarity
Wasserstein distance μiμj2+tr(Σi+Σj2(Σi1/2ΣjΣi1/2)1/2)\|\mu_i - \mu_j\|^2 + \mathrm{tr}\left(\Sigma_i + \Sigma_j - 2(\Sigma_i^{1/2}\Sigma_j\Sigma_i^{1/2})^{1/2}\right) Geometric uncertainty

6. Empirical Properties and Theoretical Guarantees

  • Information-theoretic action selection frameworks furnish formal exploration-exploitation tradeoffs, with performance guarantees derived from information-ratio, entropy, and submodularity properties.
  • Algorithms such as Thompson Sampling and IMAS2^2 exploit information-theoretic structure for statistical efficiency, dimension-insensitive regret bounds, and sharply tractable multi-agent decentralization.
  • In simulation benchmarks, such as T-maze decision-making, active inference agents optimizing expected free energy effectively interpolate between pure exploitation (utility maximization) and pure exploration (information gain maximization), outperforming agents restricted to either objective alone (Sajid et al., 2021).

7. Practical Limitations and Research Directions

The major practical limitations are rooted in computational tractability (e.g., exact expected information gain is intractable in high-dimensional and non-Gaussian settings), modeling fidelity (approximating the belief dynamics accurately), and the dependence of bounds on entropy or rate-distortion, which may still be large in some problems. Approximations such as one-step look-ahead, sampling-based estimations, or hierarchy via rate-distortion are widely employed.

A plausible implication is that future work may focus on scalable approximations of information-theoretic objectives, richer action models integrating promptable sequence models, and further leveraging structural properties (e.g., submodularity, independence assumptions) for efficient policy search in decentralized or high-dimensional domains (Min et al., 2023, Shi et al., 22 Oct 2025, Catt et al., 2021).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Information-Theoretic Action Selection.