Information-Theoretic Action Selection

Updated 15 December 2025

Information-Theoretic Action Selection is a framework where agents optimize criteria such as expected information gain and mutual information alongside utility to balance exploration and exploitation.
It integrates methodologies from bandit learning, reinforcement learning, Bayesian planning, and active inference to derive formal exploration-exploitation strategies with practical regret bounds.
The approach has broad applications in decentralized multi-agent systems and robotic perception, employing techniques like negative entropy rewards and submodular selection for efficient decision-making.

Information-theoretic action selection refers to algorithms and theoretical frameworks in which agents choose actions by explicitly optimizing information-theoretic criteria—such as expected information gain, mutual information, or rate-distortion—either alone, or in combination with classical utility. This paradigm encompasses action selection for exploration, Bayesian experimental design, model-based active inference, decentralized perception, and multi-agent coordination. Research from bandit learning, reinforcement learning, Bayesian planning, and robot perception all converge on the use of these metrics to formally balance epistemic (exploratory, uncertainty-reducing) and pragmatic (reward-seeking) objectives.

1. Fundamental Criteria: Information Gain, Mutual Information, and Expected Free Energy

Information-theoretic action selection is anchored in the maximization of information gain, typically expressed as reductions in uncertainty (entropy) about latent variables of interest. For a system with latent state variable $s$ and observable outcome $o$ under a candidate policy $\pi$ , the expected free energy framework provides a unified formalism:

$G(\pi) = \mathbb{E}_{Q(o,s|\pi)} [\ln Q(s|\pi) - \ln P(o,s)]$

This admits several equivalent decompositions:

Risk + Ambiguity:

$G(\pi) = D_{KL}[Q(s|\pi)\|P(s)] + \mathbb{E}_{Q(s|\pi)} \left[ H[P(o|s)] \right]$

where the first term penalizes deviation from preferred states and the second captures expected outcome entropy (ambiguity).

Expected Utility – Information Gain:

$G(\pi) = -\mathbb{E}_{Q(o|\pi)} [\ln P(o)] - \mathbb{E}_{Q(o,s|\pi)} \left[ \ln \frac{Q(s|o,\pi)}{Q(s|\pi)} \right]$

The first term corresponds to extrinsic value (expected utility), while the second is the expected information gain (intrinsic epistemic value).

Action selection rules derived from these criteria naturally interpolate between pure reward-seeking, pure exploration, and hybrid strategies:

Selection Principle	Formal Rule	Asymptotic Behavior
Active Inference	$\pi^* = \arg\min_\pi G(\pi)$	Balanced exploration/ exploitation
Bayesian Optimal Design	$\pi^*_{IG} = \arg\max_\pi I(s;o\|\pi)$ (if $U=0$ )	Intrinsically motivated exploration
Bayesian Decision Theory	$\pi^*_{EU} = \arg\max_\pi \mathbb{E}[U(o)]$ (if ambiguity $\rightarrow 0$ )	Pure exploitation

Selection between these regimes governs emergent exploration–exploitation trade-offs (Sajid et al., 2021).

2. Information-Theoretic Bounds in Bandit and RL Settings

Bayesian and nonstationary bandit problems illustrate the power and limitations of information-theoretic selection:

Classical Entropy-based Regret Bounds: Russo and Van Roy’s regret bounds for Bayesian bandit algorithms are proportional to $\sqrt{\Gamma H(A^*) T}$ , where $H(A^*)$ is the entropy of the optimal arm under the prior and $\Gamma$ the information-ratio (Dong et al., 2018). However, $H(A^*)$ can grow without bound in high-cardinality or continuous-action settings.
Rate-Distortion–based Regret Bounds: By quantifying the minimal mutual information needed to select a near-optimal action (given an error tolerance $D$ ), rate-distortion theory yields much tighter regret bounds:

$\mathrm{BayesRegret}(T) \leq \sqrt{\overline{\Gamma} R(\epsilon) T} + \epsilon T$

where $R(\epsilon)$ is the rate-distortion function:

$R(\epsilon) = \min_{q(\hat\theta|\theta)} I(\theta; \hat\theta)\quad \text{s.t.} \quad \mathbb{E}[d(\theta, \hat\theta)] \leq \epsilon$

In linear and logistic bandits, this leads to $O(d\sqrt{T \log T})$ regret with dimension $d$ —independent of action set cardinality (Dong et al., 2018).

Entropy Rate for Nonstationary Environments: For temporally evolving optimal actions, the per-step entropy rate $\bar{H}_\infty(A^*)$ of the latent optimal-action process tightly controls the achievable regret:

$\bar\Delta_\infty(\pi) \leq \sqrt{\Gamma(\pi)\, \bar{H}_\infty(A^*)}$

Efficient algorithms like Thompson Sampling achieve near-optimal regret if and only if the process $A^*$ remains sufficiently predictable (Min et al., 2023).

3. Algorithmic Realizations in RL and Perception

3.1 Model-Free Information-Theoretic RL

Reinforcement learning can directly embed negative entropy or information gain as the reward signal:

$r_t = - H(y_{t+1} | z_{1:t+1}, x_{1:t+1})$

Such formulations support deep Q-network–based agents that learn action selection for active information acquisition, matching or exceeding the performance of model-based planners in multi-target tracking and perception tasks, while being model-free and horizon-unlimited (Jeong et al., 2019).

3.2 Multi-Agent and Decentralized Perception

In decentralized multi-agent systems (Dec-POMDPs), selection of both agents and agent policies is optimized by maximizing the mutual information between the latent variable $X$ (e.g., world trajectory or secret) and the joint observation histories, using a two-layer greedy-submodular algorithm (IMAS $^2$ ):

$\max_{\substack{\mathcal{K} \subset \mathcal{N},\,|\mathcal{K}|=k \ \bm{\pi}_\mathcal{K}}} I(X; \bm{Y}_\mathcal{K}, M_{\bm{\pi}_\mathcal{K}})$

Submodularity ensures a $1-1/e$ performance guarantee for the greedy selection of agents and their policies (Shi et al., 22 Oct 2025).

4. Action Spaces and Information-Theoretic Actuation

Beyond the classical view where the agent’s action is considered an atomic primitive, action selection can be modeled as a compressed coding process:

Internal–External Action Decomposition: External actions $a$ are generated by internal bit-sequences $q$ via an arithmetic decoder $D_\rho$ informed by an action model $\rho(a|s)$ . The code-length $\ell_\rho(a|s) = -\log_2 \rho(a|s)$ directly expresses the information-theoretic cost.
Augmented MDPs: An internal MDP $\vartheta$ over internal deliberative states $(s, q)$ admits Bellman equations and value congruence such that optimization at the internal level translates to rational action selection at the external interface:

$Q_\vartheta^\pi(s, q, b) = Q_\mu^\Pi(s, a)$

This formalism naturally incorporates sequence-model priors for multitask RL and regularizes exploration via KL divergence to the action prior (Catt et al., 2021).

5. Divergence Criteria and Practical Planning Algorithms

In practical robotic perception and active data acquisition, action selection is typically computed by maximizing closed-form information divergences between pre- and post-action beliefs. Under Gaussian approximations, standard metrics include Kullback–Leibler divergence, Rényi divergence, Bhattacharyya distance, Fisher information metric, and Wasserstein distance. One-step look-ahead approximations are used for efficiency:

$\mathrm{EIG}(a) \approx D(\mathcal{N}(\mu^+, \Sigma^+) \| \mathcal{N}(\mu, \Sigma))$

Empirical comparisons show that all such criteria can drive rapid convergence to high-precision state estimates with sparse observations, with minor variance in convergence speed and final error (Murali et al., 2021).

Divergence Metric	Formula (Gaussian case)	Primary Utility
KL divergence	$\frac{1}{2}\left[\log\frac{\|\Sigma_j\|}{\|\Sigma_i\|} - d + \mathrm{tr}(\Sigma_j^{-1}\Sigma_i) + (\mu_i-\mu_j)^\top\Sigma_j^{-1}(\mu_i-\mu_j)\right]$	Standard MI/I-divergence
Rényi divergence	See data (Murali et al., 2021)	Parametric sensitivity
Bhattacharyya	See data	Robust similarity
Wasserstein distance	$\\|\mu_i - \mu_j\\|^2 + \mathrm{tr}\left(\Sigma_i + \Sigma_j - 2(\Sigma_i^{1/2}\Sigma_j\Sigma_i^{1/2})^{1/2}\right)$	Geometric uncertainty

6. Empirical Properties and Theoretical Guarantees

Information-theoretic action selection frameworks furnish formal exploration-exploitation tradeoffs, with performance guarantees derived from information-ratio, entropy, and submodularity properties.
Algorithms such as Thompson Sampling and IMAS $^2$ exploit information-theoretic structure for statistical efficiency, dimension-insensitive regret bounds, and sharply tractable multi-agent decentralization.
In simulation benchmarks, such as T-maze decision-making, active inference agents optimizing expected free energy effectively interpolate between pure exploitation (utility maximization) and pure exploration (information gain maximization), outperforming agents restricted to either objective alone (Sajid et al., 2021).

7. Practical Limitations and Research Directions

The major practical limitations are rooted in computational tractability (e.g., exact expected information gain is intractable in high-dimensional and non-Gaussian settings), modeling fidelity (approximating the belief dynamics accurately), and the dependence of bounds on entropy or rate-distortion, which may still be large in some problems. Approximations such as one-step look-ahead, sampling-based estimations, or hierarchy via rate-distortion are widely employed.

A plausible implication is that future work may focus on scalable approximations of information-theoretic objectives, richer action models integrating promptable sequence models, and further leveraging structural properties (e.g., submodularity, independence assumptions) for efficient policy search in decentralized or high-dimensional domains (Min et al., 2023, Shi et al., 22 Oct 2025, Catt et al., 2021).

Markdown Upgrade to Chat

References (7)

Active inference, Bayesian optimal design, and expected utility (2021)

An Information-Theoretic Analysis for Thompson Sampling with Many Actions (2018)

An Information-Theoretic Analysis of Nonstationary Bandit Learning (2023)

Learning Q-network for Active Information Acquisition (2019)

IMAS$^2$: Joint Agent Selection and Information-Theoretic Coordinated Perception In Dec-POMDPs (2025)

Reinforcement Learning with Information-Theoretic Actuation (2021)

Comparison of Information-Gain Criteria for Action Selection (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Information-Theoretic Action Selection.

Information-Theoretic Action Selection

1. Fundamental Criteria: Information Gain, Mutual Information, and Expected Free Energy

2. Information-Theoretic Bounds in Bandit and RL Settings

3. Algorithmic Realizations in RL and Perception

3.1 Model-Free Information-Theoretic RL

3.2 Multi-Agent and Decentralized Perception

4. Action Spaces and Information-Theoretic Actuation

5. Divergence Criteria and Practical Planning Algorithms

6. Empirical Properties and Theoretical Guarantees

7. Practical Limitations and Research Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

Information-Theoretic Action Selection

1. Fundamental Criteria: Information Gain, Mutual Information, and Expected Free Energy

2. Information-Theoretic Bounds in Bandit and RL Settings

3. Algorithmic Realizations in RL and Perception

3.1 Model-Free Information-Theoretic RL

3.2 Multi-Agent and Decentralized Perception

4. Action Spaces and Information-Theoretic Actuation

5. Divergence Criteria and Practical Planning Algorithms

6. Empirical Properties and Theoretical Guarantees

7. Practical Limitations and Research Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research