An Information-Geometric Approach to Artificial Curiosity

Published 8 Apr 2025 in cs.LG | (2504.06355v1)

Abstract: Learning in environments with sparse rewards remains a fundamental challenge in reinforcement learning. Artificial curiosity addresses this limitation through intrinsic rewards to guide exploration, however, the precise formulation of these rewards has remained elusive. Ideally, such rewards should depend on the agent's information about the environment, remaining agnostic to the representation of the information -- an invariance central to information geometry. Leveraging information geometry, we show that invariance under congruent Markov morphisms and the agent-environment interaction, uniquely constrains intrinsic rewards to concave functions of the reciprocal occupancy. Additional geometrically motivated restrictions effectively limits the candidates to those determined by a real parameter that governs the occupancy space geometry. Remarkably, special values of this parameter are found to correspond to count-based and maximum entropy exploration, revealing a geometric exploration-exploitation trade-off. This framework provides important constraints to the engineering of intrinsic reward while integrating foundational exploration methods into a single, cohesive model.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper introduces a novel intrinsic reward framework based on information geometry, uniquely constraining rewards via concave functions of reciprocal occupancy.
It unifies count-based and maximum entropy exploration by deriving α-information rewards that trace geodesic paths on the occupancy manifold.
The work offers practical insights for reward design, occupancy estimation, and optimization, with implications for both reinforcement learning and neuroscience.

This paper introduces a novel framework for designing intrinsic rewards in reinforcement learning (RL) based on information geometry, aiming to address the challenge of exploration in environments with sparse extrinsic rewards. The core idea is that intrinsic rewards should depend on the agent's information about the environment, represented by the state occupancy distribution ( $p_\pi$ ), and should be invariant to how this information is represented.

The authors leverage the concept of invariance under congruent Markov morphisms (information-preserving maps) and the agent-environment interaction dynamics. They demonstrate that these invariance requirements uniquely constrain the form of intrinsic rewards. Specifically, Theorem 3.2 shows that any probability-based intrinsic reward $\bar{r}(s; p)$ that is invariant under the agent-environment interaction must use the stationary occupancy distribution $p_\pi$ . Furthermore, for the intrinsic return $\bar{R}(p_\pi)$ derived from such an occupancy-based reward $\bar{r}(s) = \bar{f}[p_\pi(s)]$ to satisfy the data processing inequality (i.e., information cannot increase under processing by a statistic $\kappa$ , with equality only for sufficient statistics), the reward must take the form $\bar{r}(s) = I_f(s; p_\pi) = f[1/p_\pi(s)]$ , where $f$ is a strictly concave function. These are termed " $f$ -information rewards". This significantly narrows down the possibilities for principled intrinsic reward design from arbitrary functions to the space of strictly concave functions applied to the reciprocal occupancy.

The paper further refines this by considering geometrically motivated restrictions, focusing on $\alpha$ -information rewards, $I_\alpha(s; p_\pi)$ , derived from functions $f_\alpha$ associated with $\alpha$ -divergences. These functions are shown to be uniquely "geodetic" (Theorem 3.1), meaning their induced divergences have gradients aligned with geodesics in the space of positive measures. This property leads to a principled exploration-exploitation trade-off.

A key contribution (Theorem 3.3) is demonstrating that artificial curiosity using $\alpha$ -information rewards unifies two foundational exploration strategies:

Count-based exploration: Corresponds to $\alpha=0$ . This utilizes $I_0(s; p_\pi) \propto (p_\pi(s)^{-1/2} - 1)$ , which relates to the $1/\sqrt{n(s)}$ bonus when $p_\pi(s)$ is proportional to the state count $n(s)$ . This geometry is Riemannian (Hellinger distance).
Maximum entropy exploration: Corresponds to $\alpha=-1$ . This utilizes $I_{-1}(s; p_\pi) = -\log p_\pi(s)$ , aligning the intrinsic return objective with maximizing the Shannon entropy $H(p_\pi)$ of the occupancy distribution. This geometry is dually flat (KL divergence).

The paper explores the geometric interpretation of this framework on the "occupancy manifold," the space of possible occupancy distributions endowed with a geometry induced by the $\alpha$ -divergence. The parameter $\alpha$ controls the curvature of this manifold. The optimal occupancy distributions $p_{\alpha,\beta}$ achieved by maximizing the total return (extrinsic + intrinsic) are shown to be $\alpha$ -projections from the uniform distribution $u$ onto isoreturn hyperplanes (Theorem 4.2). Importantly, Theorem 4.4 reveals that varying the exploration-exploitation trade-off parameter $\beta$ traces an $(\alpha+2)$ -geodesic path on the occupancy manifold, connecting the purely exploitative policy ( $p_\alpha^*$ , maximizing $R(\pi)$ ) to the purely explorative policy ( $u$ , uniform occupancy).

Practical Implications:

Reward Design: Practitioners should use intrinsic rewards of the form $\beta f[1/p_\pi(s)]$ where $f$ is concave. The $\alpha$ -information family $I_\alpha$ is particularly recommended due to its geometric properties and connection to known methods. $I_0$ (generalized count-based) is highlighted as a promising candidate.
Occupancy Estimation: Implementing these rewards requires estimating the occupancy density $p_\pi(s)$ , especially in continuous or large state spaces. Methods like k-nearest neighbor or k-means density estimation, previously used in maximum entropy exploration, can be employed.
Optimization: While natural gradient ascent in the occupancy manifold offers theoretical advantages (like convexity for $\alpha=-1$ , see Proposition 4.5), it's generally intractable. Standard policy gradient methods can be used, and the paper suggests a scaling adjustment (Eq. 23) for consistency when using non- $\alpha$ -information rewards, although this adjustment is implicitly handled for $\alpha$ -information rewards.
Neuroscience: The framework suggests that biological novelty-seeking (linked to $\alpha=0$ ) and surprise (linked to $\alpha=-1$ ) might represent points on a continuous spectrum governed by the geometric parameter $\alpha$ .

In summary, the paper provides a unifying information-geometric foundation for artificial curiosity, constraining intrinsic rewards to concave functions of reciprocal occupancy based on invariance principles. It demonstrates that $\alpha$ -information rewards derived from this framework generalize and connect count-based and maximum entropy exploration through the geometry of the occupancy space, offering a principled approach to balancing exploration and exploitation.