Papers
Topics
Authors
Recent
Search
2000 character limit reached

An Information-Geometric Approach to Artificial Curiosity

Published 8 Apr 2025 in cs.LG | (2504.06355v1)

Abstract: Learning in environments with sparse rewards remains a fundamental challenge in reinforcement learning. Artificial curiosity addresses this limitation through intrinsic rewards to guide exploration, however, the precise formulation of these rewards has remained elusive. Ideally, such rewards should depend on the agent's information about the environment, remaining agnostic to the representation of the information -- an invariance central to information geometry. Leveraging information geometry, we show that invariance under congruent Markov morphisms and the agent-environment interaction, uniquely constrains intrinsic rewards to concave functions of the reciprocal occupancy. Additional geometrically motivated restrictions effectively limits the candidates to those determined by a real parameter that governs the occupancy space geometry. Remarkably, special values of this parameter are found to correspond to count-based and maximum entropy exploration, revealing a geometric exploration-exploitation trade-off. This framework provides important constraints to the engineering of intrinsic reward while integrating foundational exploration methods into a single, cohesive model.

Summary

  • The paper introduces a novel intrinsic reward framework based on information geometry, uniquely constraining rewards via concave functions of reciprocal occupancy.
  • It unifies count-based and maximum entropy exploration by deriving α-information rewards that trace geodesic paths on the occupancy manifold.
  • The work offers practical insights for reward design, occupancy estimation, and optimization, with implications for both reinforcement learning and neuroscience.

This paper introduces a novel framework for designing intrinsic rewards in reinforcement learning (RL) based on information geometry, aiming to address the challenge of exploration in environments with sparse extrinsic rewards. The core idea is that intrinsic rewards should depend on the agent's information about the environment, represented by the state occupancy distribution (pπp_\pi), and should be invariant to how this information is represented.

The authors leverage the concept of invariance under congruent Markov morphisms (information-preserving maps) and the agent-environment interaction dynamics. They demonstrate that these invariance requirements uniquely constrain the form of intrinsic rewards. Specifically, Theorem 3.2 shows that any probability-based intrinsic reward rˉ(s;p)\bar{r}(s; p) that is invariant under the agent-environment interaction must use the stationary occupancy distribution pπp_\pi. Furthermore, for the intrinsic return Rˉ(pπ)\bar{R}(p_\pi) derived from such an occupancy-based reward rˉ(s)=fˉ[pπ(s)]\bar{r}(s) = \bar{f}[p_\pi(s)] to satisfy the data processing inequality (i.e., information cannot increase under processing by a statistic κ\kappa, with equality only for sufficient statistics), the reward must take the form rˉ(s)=If(s;pπ)=f[1/pπ(s)]\bar{r}(s) = I_f(s; p_\pi) = f[1/p_\pi(s)], where ff is a strictly concave function. These are termed "ff-information rewards". This significantly narrows down the possibilities for principled intrinsic reward design from arbitrary functions to the space of strictly concave functions applied to the reciprocal occupancy.

The paper further refines this by considering geometrically motivated restrictions, focusing on α\alpha-information rewards, Iα(s;pπ)I_\alpha(s; p_\pi), derived from functions fαf_\alpha associated with α\alpha-divergences. These functions are shown to be uniquely "geodetic" (Theorem 3.1), meaning their induced divergences have gradients aligned with geodesics in the space of positive measures. This property leads to a principled exploration-exploitation trade-off.

A key contribution (Theorem 3.3) is demonstrating that artificial curiosity using α\alpha-information rewards unifies two foundational exploration strategies:

  • Count-based exploration: Corresponds to α=0\alpha=0. This utilizes I0(s;pπ)(pπ(s)1/21)I_0(s; p_\pi) \propto (p_\pi(s)^{-1/2} - 1), which relates to the 1/n(s)1/\sqrt{n(s)} bonus when pπ(s)p_\pi(s) is proportional to the state count n(s)n(s). This geometry is Riemannian (Hellinger distance).
  • Maximum entropy exploration: Corresponds to α=1\alpha=-1. This utilizes I1(s;pπ)=logpπ(s)I_{-1}(s; p_\pi) = -\log p_\pi(s), aligning the intrinsic return objective with maximizing the Shannon entropy H(pπ)H(p_\pi) of the occupancy distribution. This geometry is dually flat (KL divergence).

The paper explores the geometric interpretation of this framework on the "occupancy manifold," the space of possible occupancy distributions endowed with a geometry induced by the α\alpha-divergence. The parameter α\alpha controls the curvature of this manifold. The optimal occupancy distributions pα,βp_{\alpha,\beta} achieved by maximizing the total return (extrinsic + intrinsic) are shown to be α\alpha-projections from the uniform distribution uu onto isoreturn hyperplanes (Theorem 4.2). Importantly, Theorem 4.4 reveals that varying the exploration-exploitation trade-off parameter β\beta traces an (α+2)(\alpha+2)-geodesic path on the occupancy manifold, connecting the purely exploitative policy (pαp_\alpha^*, maximizing R(π)R(\pi)) to the purely explorative policy (uu, uniform occupancy).

Practical Implications:

  • Reward Design: Practitioners should use intrinsic rewards of the form βf[1/pπ(s)]\beta f[1/p_\pi(s)] where ff is concave. The α\alpha-information family IαI_\alpha is particularly recommended due to its geometric properties and connection to known methods. I0I_0 (generalized count-based) is highlighted as a promising candidate.
  • Occupancy Estimation: Implementing these rewards requires estimating the occupancy density pπ(s)p_\pi(s), especially in continuous or large state spaces. Methods like k-nearest neighbor or k-means density estimation, previously used in maximum entropy exploration, can be employed.
  • Optimization: While natural gradient ascent in the occupancy manifold offers theoretical advantages (like convexity for α=1\alpha=-1, see Proposition 4.5), it's generally intractable. Standard policy gradient methods can be used, and the paper suggests a scaling adjustment (Eq. 23) for consistency when using non-α\alpha-information rewards, although this adjustment is implicitly handled for α\alpha-information rewards.
  • Neuroscience: The framework suggests that biological novelty-seeking (linked to α=0\alpha=0) and surprise (linked to α=1\alpha=-1) might represent points on a continuous spectrum governed by the geometric parameter α\alpha.

In summary, the paper provides a unifying information-geometric foundation for artificial curiosity, constraining intrinsic rewards to concave functions of reciprocal occupancy based on invariance principles. It demonstrates that α\alpha-information rewards derived from this framework generalize and connect count-based and maximum entropy exploration through the geometry of the occupancy space, offering a principled approach to balancing exploration and exploitation.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 261 likes about this paper.