Latent Action Q-Learning

Updated 16 May 2026

LAQ is an offline reinforcement learning method that recovers optimal value functions from state-only data by mining discrete latent actions.
It employs a conditional forward model with an EM-style reconstruction loss to assign latent actions while preserving true value ranking.
LAQ facilitates accelerated policy learning, enhanced controller guidance, and effective cross-embodiment transfer across diverse high-dimensional domains.

Latent Action Q-Learning (LAQ) is an offline reinforcement learning (RL) method designed to recover high-quality value functions from state-only experience, where action labels are never observed. The central premise is that by mining a discrete refinement of the hidden action space from undirected state transition data, one can apply Q-learning to these "latent actions" and still recover the optimal value function, thereby enabling subsequent downstream applications such as accelerated policy learning, controller guidance, and embodiment transfer across agents and domains (Chang et al., 2022).

1. Problem Setting and Motivation

LAQ addresses the problem of learning value functions in finite Markov decision processes (MDPs) $(S, A, p, \gamma)$ from offline datasets comprised solely of state-transition-reward triplets $(s_t, s_{t+1}, r_t)$ , with the intervening actions $a_t$ unlabeled and unknown. This setting generalizes scenarios such as:

Learning from video demonstrations
Random or suboptimal exploration data
Cross-embodiment offline experiences

No assumption is made about optimality or intent in the collected data, and the sole goal is to learn a state-value function $V(s)$ that correlates with the optimal $V^*(s)$ , despite missing true action labels.

A pivotal theoretical result supporting LAQ is the preservation of optimal value functions under refinement of the action space. Given a true MDP $M = (S, A, p, \gamma)$ and a refinement $\hat M = (S, \hat A, \hat p, \gamma)$ , where each $\hat a \in \hat A$ exactly matches the dynamics and reward of some $a \in A$ and each $a \in A$ is covered by at least one $(s_t, s_{t+1}, r_t)$ 0, the theorem states:

$(s_t, s_{t+1}, r_t)$ 1

The proof leverages "fundamental action classes"—sets of actions with identical transition-reward kernels—and demonstrates that for any policy on $(s_t, s_{t+1}, r_t)$ 2 there exists a mirrored policy on $(s_t, s_{t+1}, r_t)$ 3 (and vice versa), and thus Q-learning converges to $(s_t, s_{t+1}, r_t)$ 4 in both spaces. This establishes that if one can assign any refinement of the hidden actions to observed $(s_t, s_{t+1}, r_t)$ 5 transitions, running Q-learning on these pseudo-labels recovers true optimal value functions (Chang et al., 2022).

3. Mining Latent Actions via Conditional State Prediction

LAQ constructs a discrete latent action set $(s_t, s_{t+1}, r_t)$ 6 by fitting a conditional forward model:

$(s_t, s_{t+1}, r_t)$ 7

This model predicts $(s_t, s_{t+1}, r_t)$ 8 from $(s_t, s_{t+1}, r_t)$ 9 and a candidate latent action $a_t$ 0. The objective is an EM-style reconstruction loss:

$a_t$ 1

Training alternates between:

Assigning to each $a_t$ 2 pair the latent action $a_t$ 3
Updating $a_t$ 4 by gradient descent given the current assignments

Architectures vary: MLPs for low-dimensional states; convolutional encoder-decoders for images. Each latent action learns to best "explain" observed state transitions, in effect yielding a data-driven refinement of the (hidden) ground-truth action space.

4. LAQ Algorithmic Workflow

With the latent forward model fit, the dataset is relabeled as $a_t$ 5. Standard offline Q-learning is then applied:

$a_t$ 6

Value estimation follows $a_t$ 7. Variants with DQN or BCQ are directly compatible for high-dimensional or continuous state spaces.

High-level pseudocode:

Step	Detail	Purpose
Fit forward model $a_t$ 8	EM-style min reconstruction loss	Uncover latent actions
Label latent actions	$a_t$ 9	Assign to transitions
Q-learning update	On $V(s)$ 0 quadruples	Learn value function
Value extraction	$V(s)$ 1	Output for downstream RL

5. Empirical Evaluation and Recovery of Value Functions

LAQ's recovered values are benchmarked via Spearman's rank correlation $V(s)$ 2 with reference value functions trained on ground-truth actions. Results across various domains:

Environment	LAQ $V(s)$ 3	Ground Truth $V(s)$ 4
2D Grid World	≈ 0.985	≈ 1.000
Atari Freeway (images)	≈ 0.961	≈ 0.970
3D Visual Navigation	≈ 0.927	≈ 0.991
Maze2D (continuous)	≈ 0.844	≈ 0.851
FrankaKitchen Manipulation	≈ 0.905	≈ 0.901

The data indicate that LAQ's value functions match the rank orderings given by the true action labels very closely. In all cases, LAQ outperforms clustering-based or "one-action" baselines, and approaches oracle performance (Chang et al., 2022).

6. Downstream Applications

LAQ enables several practical outcomes using value functions learned solely from state-only data:

Reward Shaping for Sample Efficiency: Using $V(s)$ 5 as a potential-based shaping term densifies sparse reward RL (so-called "densified RL"), resulting in 3–10× faster learning compared to using the sparse reward alone.
Low-Level Controller Guidance: Value functions derived from LAQ allow selection among a small set of primitive controllers $V(s)$ 6 by choosing the one-step outcome with highest $V(s)$ 7. This provides zero-shot navigation in high-dimensional domains (e.g., SPL = 0.82 versus 0.53 for naive strategies).
Cross-Embodiment Transfer: Learned $V(s)$ 8 functions can be mapped to the state-space of a new embodiment (e.g., quadruped agent, robot arm), accelerating RL in the new agent by 2–5× relative to training from scratch.

7. Experimental Design and Comparative Metrics

Experiments span a wide range of high- and low-dimensional domains:

2D GridWorld (tabular, sparse rewards)
Atari Freeway (84×84 image inputs, discrete actions)
Maze2D (continuous navigation)
3D Visual Navigation (Habitat building scans)
FrankaKitchen (9-DOF manipulation, embodiment transfer with "hook" variant)

Baselines include:

Single-action labeling with TD(0)
K-means clustering on concatenated state and next-state pairs, or on state transitions
D3G (state-only Q-learning via generative modeling)
Behavior cloning, with or without RL fine-tuning
Prior inverse model labeling methods

Key metrics:

Spearman's rank correlation $V(s)$ 9 to $V^*(s)$ 0
Mean squared error to tabular $V^*(s)$ 1 (toy MDPs)
Sample complexity (return vs. interactions)
SPL (Success weighted by Path Length) for navigation

LAQ consistently achieves value ranking within 98% of ground-truth Q-learning in grid worlds, 96% in Freeway, and 93% in 3D navigation. Densified RL using LAQ accelerates convergence by 3–10× compared to sparse reward RL, often outperforming behavior cloning with RL fine-tuning. Cross-embodiment transfer using LAQ values accelerates RL by factors of 2–5 compared to baseline agents.

LAQ thus demonstrates that value function learning is not fundamentally dependent on access to true action labels. By extracting a data-driven, latent refinement of the underlying action space, it becomes feasible to recover value functions sufficient for a variety of high-level RL applications, even in settings involving high-dimensional observations and cross-embodiment transfer (Chang et al., 2022).

Markdown Report Issue Upgrade to Chat

References (1)

Learning Value Functions from Undirected State-only Experience (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Latent Action Q-Learning (LAQ).

Latent Action Q-Learning

1. Problem Setting and Motivation

2. Theoretical Foundation: Action-Space Refinement and Value Preservation

3. Mining Latent Actions via Conditional State Prediction

4. LAQ Algorithmic Workflow

5. Empirical Evaluation and Recovery of Value Functions

6. Downstream Applications

7. Experimental Design and Comparative Metrics

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Latent Action Q-Learning

1. Problem Setting and Motivation

2. Theoretical Foundation: Action-Space Refinement and Value Preservation

3. Mining Latent Actions via Conditional State Prediction

4. LAQ Algorithmic Workflow

5. Empirical Evaluation and Recovery of Value Functions

6. Downstream Applications

7. Experimental Design and Comparative Metrics

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research