Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts
Detailed Answer
Thorough responses based on abstracts and some paper content
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash
115 tokens/sec
GPT-4o
79 tokens/sec
Gemini 2.5 Pro Pro
56 tokens/sec
o3 Pro
15 tokens/sec
GPT-4.1 Pro
76 tokens/sec
DeepSeek R1 via Azure Pro
54 tokens/sec
2000 character limit reached

Task-Agnostic Latent Action Space (CLASP)

Last updated: June 11, 2025

A task-agnostic latent action space ° is a representation learned directly from visual observations ° which encodes an agent’s possible actions—without any requirement for access to explicit action labels, task specifications, or goals during the learning phase. In the CLASP (Composable Learned Action Space Predictor) framework, this is achieved by observing an agent interacting with its environment (e.g., via video), and using a generative model to parse out only those aspects of scene changes that carry “action-like” dynamics.


Implementing CLASP: Learning a Task-Agnostic Latent Action Space

1. Stochastic Video Prediction Model

At its core, CLASP employs a variational, stochastic video prediction ° architecture. Here’s how this works in practice:

  • Input: A sequence of images x1:Tx_{1:T} representing the agent’s interaction with the environment.
  • Encoder/Inference network: For each tt, infer a latent ztz_t from (xt1,xt)(x_{t-1}, x_t). This latent is expected to capture the change—essentially, what “action” caused the scene to transform from xt1x_{t-1} to xtx_t.
  • Generative Model: Predict xtx_t from xt1x_{t-1} and ztz_t via a decoder.

Minimality Constraint: The latent ztz_t is regularized (see below) so it contains only information related to scene change, not static appearance.

Composability °: A learned composition function g(zi,zj)g(z_i, z_j) is trained such that concatenating two action latents mimics applying their underlying physical actions sequentially. This compositional property ° ensures the latent is action-structured—not entangled with static content.

Implementation sketch:

1
2
3
4
5
6
7
def predict_next_frame(x_tm1, x_t, encoder, decoder, kl_beta):
    z_t_mu, z_t_logvar = encoder(x_tm1, x_t)
    z_t = sample_from_normal(z_t_mu, z_t_logvar)
    x_t_pred = decoder(x_tm1, z_t)
    pred_loss = mse_loss(x_t_pred, x_t)
    kl_loss = kl_divergence(z_t_mu, z_t_logvar, prior_mu=0, prior_logvar=0)
    return pred_loss + kl_beta * kl_loss


2. Information Bottleneck and Loss Functions

CLASP extends a VAE ° with additional information bottleneck ° objectives. The key losses:

  • Prediction loss: Standard VAE ELBO ° where the objective is to reconstruct (predict) the next frame.
  • Minimality loss (βz\beta_z): Penalize mutual information between ztz_t and (xt1,xt)(x_{t-1}, x_t); keep ztz_t as small as possible while allowing accurate prediction.
  • Composability loss (βν\beta_\nu): For sequences, enforce that z1:Tz_{1:T} can be composed into a “trajectory” latent ν\nu via gg, and that ν\nu is minimally informative while still reconstructing end states. This enforces the latent action structure to support composition.

Practical loss implementation (from the paper’s equations):

[ \mathcal{L}{pred}{\theta,\phi}(\mathbf{x}{1:T}) = \sum_{t=1}T \left[ \mathbb{E}{q\phi(\mathbf{z}{2:t}|\mathbf{x}{1:t})} \log p_\theta(x_t|\mathbf{x}{1:t-1}, \mathbf{z}{2:t}) - \beta_z D_{KL}(q_\phi(z_t|\mathbf{x}_{t-1:t}) || p(z)) \right] ]

[ \mathcal{L}{comp}{\theta,\phi,\zeta}(\mathbf{x}{1:T}) = \sum_{t=1}{T_C} \left[ \mathbb{E}{q{\phi,\zeta}(\nu_{1:t}|\mathbf{x}{1:T})} \log p\theta(x_{t_T}|\mathbf{x}{1:(t-1)T_C}, \nu{1:t}) - \beta_\nu D_{KL}(q_{\phi,\zeta}(\nu_t|\mathbf{x}_{(t-1)T_C:tT_C}) || p(\nu)) \right] ] with total loss Lθ,ϕ,ζtotal=Lpred+Lcomp\mathcal{L}^{total}_{\theta,\phi,\zeta} = \mathcal{L}^{pred} + \mathcal{L}^{comp}.


3. Semi-Supervised Action Mapping for Downstream Use

After learning a task-agnostic ztz_t, CLASP can be fine-tuned to recover true actions or conditioned action prediction ° with dramatically fewer labels:

  • Collect a small set of action-labeled sequences.
  • Train two MLPs: One maps actions uu to zz (MLPlat\text{MLP}_{\text{lat}}), and one maps latents zz to actions (MLPact\text{MLP}_{\text{act}}).
  • No gradients are propagated into the main model—only these heads are trained, maintaining the purely unsupervised “core”.

Implementation example:

1
2
3
4
5
6
7
8
optimizer = Adam(mlp_lat.parameters())
for (x_tm1, x_t, u) in labeled_dataset:
    z_t = encode_action_step(x_tm1, x_t)           # Fixed, from CLASP encoder
    z_pred = mlp_lat(u)
    loss = mse_loss(z_pred, z_t)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
Once trained, you can translate between discrete action commands and their visual effects ° using these heads.


4. Applications and Real-World Integration

  • Passive Robot Learning: Robots can watch unlabeled videos (e.g., of humans, other robots, or themselves in different embodiments) and build an inventory of actionable latent effects. This is highly data-efficient for environments where collecting action labels is expensive or infeasible.
  • Rapid Adaptation: In new domains—new backgrounds, lighting, or even different agent shapes—the latent space remains robust because it is disentangled from static content.
  • Planning: Downstream, CLASP supports action-conditioned video prediction and planning (e.g., visual servoing), even matching supervised baseline performance ° with very little labeled data (e.g., mean angle error of 2.9° on the reacher vs 2.6° for fully supervised).

Empirical results:

Task CLASP (semi-sup) Supervised Baseline
Reacher, Angle Error 2.9° 2.6°
Push, MAE ° 0.025 0.024

But CLASP uses orders of magnitude fewer labeled examples °.


5. Deployment Considerations and Trade-offs

  • Compute needs: Comparable to VAEs ° plus an extra composition head—efficient on standard GPUs ° for short video clips/sequences.
  • Scale: For higher dimensional scenes or longer horizons, training remains tractable, but evaluate latent bottleneck size and composition function capacity.
  • Robustness: Strong robustness across backgrounds and agents. The model specifically avoids overfitting action representations to background/static features.
  • Limitations: The representation is only as expressive as the agent’s observed behaviors; very rare ° or unseen actions require more coverage in the dataset.

Summary Table: CLASP vs. Fully Supervised Approaches

Criterion CLASP (Semi-supervised) Fully Supervised Models °
Labeled action data Very few sequences needed All or most data labeled
Task/appearance independence Yes No
Robustness to environment variation Strong Often weak
Performance (prediction, planning) Near parity or better Strong (with labels)
Reusability / transfer High Low–moderate

Practical Steps to Apply CLASP

  1. Collect passive video data of agents performing diverse behaviors.
  2. Train a stochastic video prediction VAE with minimality and composability/compositionality objectives as above.
  3. (Optional) Label a small subset of sequences with action commands; fit lightweight mapping heads.
  4. Deploy for downstream applications: Visual servoing, imitation-from-observation, or as a general action space prior for transfer learning or meta-learning.

For robotics, computer vision, or imitation learning systems requiring robust, scalable action representations without burdensome labeling, CLASP provides an efficient, adaptable, and extensible solution with strong empirical backing.