Papers
Topics
Authors
Recent
Search
2000 character limit reached

Latent Action Q-Learning

Updated 16 May 2026
  • LAQ is an offline reinforcement learning method that recovers optimal value functions from state-only data by mining discrete latent actions.
  • It employs a conditional forward model with an EM-style reconstruction loss to assign latent actions while preserving true value ranking.
  • LAQ facilitates accelerated policy learning, enhanced controller guidance, and effective cross-embodiment transfer across diverse high-dimensional domains.

Latent Action Q-Learning (LAQ) is an offline reinforcement learning (RL) method designed to recover high-quality value functions from state-only experience, where action labels are never observed. The central premise is that by mining a discrete refinement of the hidden action space from undirected state transition data, one can apply Q-learning to these "latent actions" and still recover the optimal value function, thereby enabling subsequent downstream applications such as accelerated policy learning, controller guidance, and embodiment transfer across agents and domains (Chang et al., 2022).

1. Problem Setting and Motivation

LAQ addresses the problem of learning value functions in finite Markov decision processes (MDPs) (S,A,p,γ)(S, A, p, \gamma) from offline datasets comprised solely of state-transition-reward triplets (st,st+1,rt)(s_t, s_{t+1}, r_t), with the intervening actions ata_t unlabeled and unknown. This setting generalizes scenarios such as:

  • Learning from video demonstrations
  • Random or suboptimal exploration data
  • Cross-embodiment offline experiences

No assumption is made about optimality or intent in the collected data, and the sole goal is to learn a state-value function V(s)V(s) that correlates with the optimal V(s)V^*(s), despite missing true action labels.

2. Theoretical Foundation: Action-Space Refinement and Value Preservation

A pivotal theoretical result supporting LAQ is the preservation of optimal value functions under refinement of the action space. Given a true MDP M=(S,A,p,γ)M = (S, A, p, \gamma) and a refinement M^=(S,A^,p^,γ)\hat M = (S, \hat A, \hat p, \gamma), where each a^A^\hat a \in \hat A exactly matches the dynamics and reward of some aAa \in A and each aAa \in A is covered by at least one (st,st+1,rt)(s_t, s_{t+1}, r_t)0, the theorem states:

(st,st+1,rt)(s_t, s_{t+1}, r_t)1

The proof leverages "fundamental action classes"—sets of actions with identical transition-reward kernels—and demonstrates that for any policy on (st,st+1,rt)(s_t, s_{t+1}, r_t)2 there exists a mirrored policy on (st,st+1,rt)(s_t, s_{t+1}, r_t)3 (and vice versa), and thus Q-learning converges to (st,st+1,rt)(s_t, s_{t+1}, r_t)4 in both spaces. This establishes that if one can assign any refinement of the hidden actions to observed (st,st+1,rt)(s_t, s_{t+1}, r_t)5 transitions, running Q-learning on these pseudo-labels recovers true optimal value functions (Chang et al., 2022).

3. Mining Latent Actions via Conditional State Prediction

LAQ constructs a discrete latent action set (st,st+1,rt)(s_t, s_{t+1}, r_t)6 by fitting a conditional forward model:

(st,st+1,rt)(s_t, s_{t+1}, r_t)7

This model predicts (st,st+1,rt)(s_t, s_{t+1}, r_t)8 from (st,st+1,rt)(s_t, s_{t+1}, r_t)9 and a candidate latent action ata_t0. The objective is an EM-style reconstruction loss:

ata_t1

Training alternates between:

  • Assigning to each ata_t2 pair the latent action ata_t3
  • Updating ata_t4 by gradient descent given the current assignments

Architectures vary: MLPs for low-dimensional states; convolutional encoder-decoders for images. Each latent action learns to best "explain" observed state transitions, in effect yielding a data-driven refinement of the (hidden) ground-truth action space.

4. LAQ Algorithmic Workflow

With the latent forward model fit, the dataset is relabeled as ata_t5. Standard offline Q-learning is then applied:

ata_t6

Value estimation follows ata_t7. Variants with DQN or BCQ are directly compatible for high-dimensional or continuous state spaces.

High-level pseudocode:

Step Detail Purpose
Fit forward model ata_t8 EM-style min reconstruction loss Uncover latent actions
Label latent actions ata_t9 Assign to transitions
Q-learning update On V(s)V(s)0 quadruples Learn value function
Value extraction V(s)V(s)1 Output for downstream RL

5. Empirical Evaluation and Recovery of Value Functions

LAQ's recovered values are benchmarked via Spearman's rank correlation V(s)V(s)2 with reference value functions trained on ground-truth actions. Results across various domains:

Environment LAQ V(s)V(s)3 Ground Truth V(s)V(s)4
2D Grid World ≈ 0.985 ≈ 1.000
Atari Freeway (images) ≈ 0.961 ≈ 0.970
3D Visual Navigation ≈ 0.927 ≈ 0.991
Maze2D (continuous) ≈ 0.844 ≈ 0.851
FrankaKitchen Manipulation ≈ 0.905 ≈ 0.901

The data indicate that LAQ's value functions match the rank orderings given by the true action labels very closely. In all cases, LAQ outperforms clustering-based or "one-action" baselines, and approaches oracle performance (Chang et al., 2022).

6. Downstream Applications

LAQ enables several practical outcomes using value functions learned solely from state-only data:

  • Reward Shaping for Sample Efficiency: Using V(s)V(s)5 as a potential-based shaping term densifies sparse reward RL (so-called "densified RL"), resulting in 3–10× faster learning compared to using the sparse reward alone.
  • Low-Level Controller Guidance: Value functions derived from LAQ allow selection among a small set of primitive controllers V(s)V(s)6 by choosing the one-step outcome with highest V(s)V(s)7. This provides zero-shot navigation in high-dimensional domains (e.g., SPL = 0.82 versus 0.53 for naive strategies).
  • Cross-Embodiment Transfer: Learned V(s)V(s)8 functions can be mapped to the state-space of a new embodiment (e.g., quadruped agent, robot arm), accelerating RL in the new agent by 2–5× relative to training from scratch.

7. Experimental Design and Comparative Metrics

Experiments span a wide range of high- and low-dimensional domains:

  • 2D GridWorld (tabular, sparse rewards)
  • Atari Freeway (84×84 image inputs, discrete actions)
  • Maze2D (continuous navigation)
  • 3D Visual Navigation (Habitat building scans)
  • FrankaKitchen (9-DOF manipulation, embodiment transfer with "hook" variant)

Baselines include:

  • Single-action labeling with TD(0)
  • K-means clustering on concatenated state and next-state pairs, or on state transitions
  • D3G (state-only Q-learning via generative modeling)
  • Behavior cloning, with or without RL fine-tuning
  • Prior inverse model labeling methods

Key metrics:

  • Spearman's rank correlation V(s)V(s)9 to V(s)V^*(s)0
  • Mean squared error to tabular V(s)V^*(s)1 (toy MDPs)
  • Sample complexity (return vs. interactions)
  • SPL (Success weighted by Path Length) for navigation

LAQ consistently achieves value ranking within 98% of ground-truth Q-learning in grid worlds, 96% in Freeway, and 93% in 3D navigation. Densified RL using LAQ accelerates convergence by 3–10× compared to sparse reward RL, often outperforming behavior cloning with RL fine-tuning. Cross-embodiment transfer using LAQ values accelerates RL by factors of 2–5 compared to baseline agents.


LAQ thus demonstrates that value function learning is not fundamentally dependent on access to true action labels. By extracting a data-driven, latent refinement of the underlying action space, it becomes feasible to recover value functions sufficient for a variety of high-level RL applications, even in settings involving high-dimensional observations and cross-embodiment transfer (Chang et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Latent Action Q-Learning (LAQ).