Latent Action Q-Learning
- LAQ is an offline reinforcement learning method that recovers optimal value functions from state-only data by mining discrete latent actions.
- It employs a conditional forward model with an EM-style reconstruction loss to assign latent actions while preserving true value ranking.
- LAQ facilitates accelerated policy learning, enhanced controller guidance, and effective cross-embodiment transfer across diverse high-dimensional domains.
Latent Action Q-Learning (LAQ) is an offline reinforcement learning (RL) method designed to recover high-quality value functions from state-only experience, where action labels are never observed. The central premise is that by mining a discrete refinement of the hidden action space from undirected state transition data, one can apply Q-learning to these "latent actions" and still recover the optimal value function, thereby enabling subsequent downstream applications such as accelerated policy learning, controller guidance, and embodiment transfer across agents and domains (Chang et al., 2022).
1. Problem Setting and Motivation
LAQ addresses the problem of learning value functions in finite Markov decision processes (MDPs) from offline datasets comprised solely of state-transition-reward triplets , with the intervening actions unlabeled and unknown. This setting generalizes scenarios such as:
- Learning from video demonstrations
- Random or suboptimal exploration data
- Cross-embodiment offline experiences
No assumption is made about optimality or intent in the collected data, and the sole goal is to learn a state-value function that correlates with the optimal , despite missing true action labels.
2. Theoretical Foundation: Action-Space Refinement and Value Preservation
A pivotal theoretical result supporting LAQ is the preservation of optimal value functions under refinement of the action space. Given a true MDP and a refinement , where each exactly matches the dynamics and reward of some and each is covered by at least one 0, the theorem states:
1
The proof leverages "fundamental action classes"—sets of actions with identical transition-reward kernels—and demonstrates that for any policy on 2 there exists a mirrored policy on 3 (and vice versa), and thus Q-learning converges to 4 in both spaces. This establishes that if one can assign any refinement of the hidden actions to observed 5 transitions, running Q-learning on these pseudo-labels recovers true optimal value functions (Chang et al., 2022).
3. Mining Latent Actions via Conditional State Prediction
LAQ constructs a discrete latent action set 6 by fitting a conditional forward model:
7
This model predicts 8 from 9 and a candidate latent action 0. The objective is an EM-style reconstruction loss:
1
Training alternates between:
- Assigning to each 2 pair the latent action 3
- Updating 4 by gradient descent given the current assignments
Architectures vary: MLPs for low-dimensional states; convolutional encoder-decoders for images. Each latent action learns to best "explain" observed state transitions, in effect yielding a data-driven refinement of the (hidden) ground-truth action space.
4. LAQ Algorithmic Workflow
With the latent forward model fit, the dataset is relabeled as 5. Standard offline Q-learning is then applied:
6
Value estimation follows 7. Variants with DQN or BCQ are directly compatible for high-dimensional or continuous state spaces.
High-level pseudocode:
| Step | Detail | Purpose |
|---|---|---|
| Fit forward model 8 | EM-style min reconstruction loss | Uncover latent actions |
| Label latent actions | 9 | Assign to transitions |
| Q-learning update | On 0 quadruples | Learn value function |
| Value extraction | 1 | Output for downstream RL |
5. Empirical Evaluation and Recovery of Value Functions
LAQ's recovered values are benchmarked via Spearman's rank correlation 2 with reference value functions trained on ground-truth actions. Results across various domains:
| Environment | LAQ 3 | Ground Truth 4 |
|---|---|---|
| 2D Grid World | ≈ 0.985 | ≈ 1.000 |
| Atari Freeway (images) | ≈ 0.961 | ≈ 0.970 |
| 3D Visual Navigation | ≈ 0.927 | ≈ 0.991 |
| Maze2D (continuous) | ≈ 0.844 | ≈ 0.851 |
| FrankaKitchen Manipulation | ≈ 0.905 | ≈ 0.901 |
The data indicate that LAQ's value functions match the rank orderings given by the true action labels very closely. In all cases, LAQ outperforms clustering-based or "one-action" baselines, and approaches oracle performance (Chang et al., 2022).
6. Downstream Applications
LAQ enables several practical outcomes using value functions learned solely from state-only data:
- Reward Shaping for Sample Efficiency: Using 5 as a potential-based shaping term densifies sparse reward RL (so-called "densified RL"), resulting in 3–10× faster learning compared to using the sparse reward alone.
- Low-Level Controller Guidance: Value functions derived from LAQ allow selection among a small set of primitive controllers 6 by choosing the one-step outcome with highest 7. This provides zero-shot navigation in high-dimensional domains (e.g., SPL = 0.82 versus 0.53 for naive strategies).
- Cross-Embodiment Transfer: Learned 8 functions can be mapped to the state-space of a new embodiment (e.g., quadruped agent, robot arm), accelerating RL in the new agent by 2–5× relative to training from scratch.
7. Experimental Design and Comparative Metrics
Experiments span a wide range of high- and low-dimensional domains:
- 2D GridWorld (tabular, sparse rewards)
- Atari Freeway (84×84 image inputs, discrete actions)
- Maze2D (continuous navigation)
- 3D Visual Navigation (Habitat building scans)
- FrankaKitchen (9-DOF manipulation, embodiment transfer with "hook" variant)
Baselines include:
- Single-action labeling with TD(0)
- K-means clustering on concatenated state and next-state pairs, or on state transitions
- D3G (state-only Q-learning via generative modeling)
- Behavior cloning, with or without RL fine-tuning
- Prior inverse model labeling methods
Key metrics:
- Spearman's rank correlation 9 to 0
- Mean squared error to tabular 1 (toy MDPs)
- Sample complexity (return vs. interactions)
- SPL (Success weighted by Path Length) for navigation
LAQ consistently achieves value ranking within 98% of ground-truth Q-learning in grid worlds, 96% in Freeway, and 93% in 3D navigation. Densified RL using LAQ accelerates convergence by 3–10× compared to sparse reward RL, often outperforming behavior cloning with RL fine-tuning. Cross-embodiment transfer using LAQ values accelerates RL by factors of 2–5 compared to baseline agents.
LAQ thus demonstrates that value function learning is not fundamentally dependent on access to true action labels. By extracting a data-driven, latent refinement of the underlying action space, it becomes feasible to recover value functions sufficient for a variety of high-level RL applications, even in settings involving high-dimensional observations and cross-embodiment transfer (Chang et al., 2022).