Papers
Topics
Authors
Recent
Search
2000 character limit reached

Eigenoption Discovery in RL

Updated 31 December 2025
  • Eigenoption discovery is a method that uses spectral analysis of state-transition graphs to create intrinsic rewards guiding temporally extended actions.
  • It leverages Laplacian eigenfunctions and successor representations to design options that navigate bottlenecks and improve credit assignment in complex environments.
  • Recent advances integrate eigenoptions with deep RL, yielding up to 40% higher state coverage and more robust adaptation in dynamic tasks.

Eigenoption discovery is a principled framework for autonomously generating temporally extended options in reinforcement learning (RL) via spectral analysis of the agent’s state-transition topology. Eigenoptions leverage the spectral decomposition of graph Laplacians or the successor representation to construct intrinsic reward functions whose optimization yields policies that traverse bottlenecks and explore diverse regions of the domain. These methods provide foundational support for both exploration and credit assignment, and have seen significant extension from tabular to deep RL settings, including online and scalable algorithms.

1. Formal Foundations: Graph Laplacian, Eigenproblem, and Intrinsic Rewards

Eigenoption discovery begins with a representation of the Markov decision process (MDP) as a state-transition graph G=(S,E)G=(\mathcal{S},E), with adjacency matrix AA and degree matrix DD defined by

Ass={1(s,s)E, 0otherwise,Dss=sAss.A_{ss'} = \begin{cases} 1 & (s,s')\in E,\ 0 &\text{otherwise} \end{cases} ,\qquad D_{ss}=\sum_{s'}A_{ss'}.

The (normalized) Laplacian is

Lnorm=ID1/2AD1/2L_{\rm norm} = I - D^{-1/2} A D^{-1/2}

or, for weighted transitions under behavior policy π\pi, Wss=aπ(as)p(ss,a)W_{ss'} = \sum_a \pi(a|s) p(s'|s,a) is used in place of AA (Klissarov et al., 2023).

The spectral decomposition

Lϕi=λiϕi,0=λ1λ2L \phi_i = \lambda_i \phi_i,\quad 0=\lambda_1\leq\lambda_2\leq\cdots

provides proto-value functions ϕi\phi_i describing principal diffusion modes of the domain (Bar et al., 2020). Each nontrivial eigenvector (i2i \geq 2) defines an intrinsic reward

r(i)(s,s)=ϕi(s)ϕi(s)r^{(i)}(s,s') = \phi_i(s') - \phi_i(s)

which, if maximized, drives the agent to ascend the corresponding diffusion direction. These proto-value functions are smooth over bottlenecks (low-frequency modes) and capture both global and local topology (Klissarov et al., 2023, Bar et al., 2020).

2. Eigenoption Construction: Policy, Initiation, Termination

Each eigenvector gives rise to a specific eigenoption oi=(Ii,πi,βi)o_i=(\mathcal{I}_i,\pi_i,\beta_i):

  • Initiation set (Ii\mathcal{I}_i): typically all states, or restricted to those supporting positive expected increment in ϕi\phi_i (Bar et al., 2020).
  • Policy (πi\pi_i): greedily ascends ϕi\phi_i, e.g., argmaxaE[ϕi(s)s,a]\arg\max_a\, \mathbb{E}[\phi_i(s')|s,a], or via Q-learning on r(i)r^{(i)} (Machado et al., 2017, Kotamreddy et al., 12 Jul 2025).
  • Termination (βi\beta_i): terminates when maxaE[ϕi(s)ϕi(s)]0\max_a\,\mathbb{E}[\phi_i(s')-\phi_i(s)]\leq 0 (local extremum), or via fixed random termination for stability in deep RL (Klissarov et al., 2023, Kotamreddy et al., 12 Jul 2025).

These options are temporally extended controllers corresponding to traversing specific structural features in the state-space, from global “rooms” (low-frequency modes) to finer “corridors” (high-frequency modes) (Bar et al., 2020).

3. Deep and Online Eigenoption Discovery: Losses, Network Architectures

Scalable eigenoption discovery in high-dimensional RL domains uses direct neural approximation of Laplacian eigenfunctions or successor representation:

  • Online variational objective: Minimizes empirical Dirichlet energy under orthonormality,

G(θ)=12E[ik(fθk(s)fθk(s))2]+βE[i,j,k(fθj(s)fθk(s)δjk)2]G(\theta) = \frac{1}{2} \mathbb{E}\left[\sum_i \sum_k (f^k_\theta(s) - f^k_\theta(s'))^2 \right] + \beta \mathbb{E}\left[\sum_{i,j,k} (f^j_\theta(s)f^k_\theta(s) - \delta_{jk})^2 \right]

using stochastic samples from a replay buffer (Klissarov et al., 2023).

  • Network architecture: Stacked convolutional layers (2×conv, 1×FC) for image input, outputs dd linear units for the Laplacian embedding (Klissarov et al., 2023, Kotamreddy et al., 12 Jul 2025). No target network is used for the Laplacian net.
  • Training schedule: At each step, both the Laplacian network and the option Q-networks (e.g., Double DQN) are updated online using shared experience (Klissarov et al., 2023).
  • Successor Representation Alternative: Deep SR networks predict discounted occupancy vectors; eigenoptions correspond to top eigenvectors of the learned SR matrix (Machado et al., 2017). TD-style losses are used to incrementally update SR estimates from environment transitions.

This architecture permits eigenoption discovery from raw pixels, effectively scaling the approach to complex domains (Klissarov et al., 2023, Machado et al., 2017, Kotamreddy et al., 12 Jul 2025).

4. Theoretical Guarantees and Manifold Justification

Eigenoptions are theoretically justified via the connection between Laplacian eigenfunctions and the geometry of the state-space:

  • Manifold interpretation: If the environment's states lie on a Riemannian manifold, the Laplacian approximates the Laplace–Beltrami operator; its eigenfunctions form a natural function basis (Bar et al., 2020).
  • Diffusion distance: The average diffusion distance between states is captured by the spectral components of Laplacian eigenfunctions, supporting diverse coverage and efficient exploration (Bar et al., 2020).
  • Successor Representation equivalence: The eigenvectors of the SR matrix are rescaled proto-value functions, inheriting the manifold properties and further supporting the use of SR-derived eigenoptions for both discrete and stochastic environments (Machado et al., 2017).

5. Value-Aware Extensions and Option-Critic Integration

Recent work generalizes eigenoption discovery to value-aware and hierarchical settings:

  • Value-aware eigenoptions (VAEO): Option-values Q(s,ok;θ)Q(s,o_k;\theta) are learned via intra-option Q-learning with DDQN targets, supporting mixed function approximation. Termination functions βo(s)\beta_o(s) are regularized to align with external credit assignment (Kotamreddy et al., 12 Jul 2025).
  • Eigenoption-Critic (EOC): Combines intrinsic and extrinsic rewards,

r(s,a,o)=αrino(ss)+(1α)rex(s,a)r(s,a,o) = \alpha\,r_{\mathrm{in}}^o(s\to s') + (1-\alpha) r_{\mathrm{ex}}(s,a)

and interleaves option-value learning with policy optimization (Liu et al., 2017). EOC yields robust performance under nonstationary goals and continuous state-spaces, supported by Nyström approximation for eigenvectors.

These frameworks integrate spectral skills directly into hierarchical RL architectures, eliminating strict separation between option discovery and policy optimization.

6. Empirical Findings: Exploration, Credit Assignment, and Scaling

Empirical evaluation demonstrates the utility, scalability, and limitations of eigenoption-based exploration:

  • Exploration: Deep Laplacian-based option discovery (DCEO) attains ≥40% more state coverage in 100 episodes than count-based or action-repeat baselines; DCEO reaches goals in ≈10,000 steps versus ≥30,000 for count-based or RND methods in tabular navigation (Klissarov et al., 2023).
  • Credit assignment: VAEO propagates credit up to 2× faster than bottleneck options in gridworlds, with eigenoptions accelerating Q-value updates over primitive actions (Kotamreddy et al., 12 Jul 2025).
  • Nonstationarity and adaptation: DCEO rapidly re-discovers new eigenoptions when goals/topology shift, recovering performance in ≤5,000 steps versus stagnation of count-based/RND baselines (Klissarov et al., 2023). EOC adapts more efficiently in both discrete and continuous tasks (Liu et al., 2017).
  • Deep RL scaling: In pixel-based Minigrid rooms and Atari Montezuma’s Revenge, DCEO and deep eigenoption approaches yield meaningful subgoal-reaching skills; e.g., 2,500 points in Montezuma’s Revenge in 200 million frames versus 2 billion for RND (Klissarov et al., 2023, Machado et al., 2017).
  • Termination sensitivity: Fixed-horizon termination stabilizes learning under deep RL; constant β\beta termination induces high variance and poor credit propagation (Kotamreddy et al., 12 Jul 2025).

A plausible implication is that the design of option termination and the accuracy of online Laplacian/successor estimates critically impact the efficacy of eigenoption-driven learning.

7. Limitations, Open Challenges, and Ongoing Research

While eigenoptions offer robust exploration and hierarchy, challenges remain:

  • Online discovery bias: Early inaccurate spectral estimates bias exploration, occasionally degrading sample efficiency; estimation error bounds are tied to the sample size and spectral gap (Kotamreddy et al., 12 Jul 2025).
  • Termination function design: Poorly chosen termination rates destabilize option-value learning and impede credit assignment in deep RL (Kotamreddy et al., 12 Jul 2025).
  • Continuous domains: Approximation schemes like Nyström extend eigenoptions beyond discrete graphs but rely on suitable anchor selection and interpolation accuracy (Liu et al., 2017).
  • Integration with extrinsic reward: Combining spectral exploration rewards with task objectives requires careful balancing (α\alpha parameter), with empirical results favoring moderate mixing (Liu et al., 2017).

Future research is investigating compositional planning, option transfer across tasks, and integration of learned spectral skills into end-to-end policy architectures. Theoretical analysis of regret, sample complexity, and convergence for deep eigenoption methods remains an open direction.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Eigenoption Discovery.