Task-Agnostic Latent Action Space (CLASP)
Last updated: June 11, 2025
A task-agnostic latent action space ° is a representation learned directly from visual observations ° which encodes an agent’s possible actions—without any requirement for access to explicit action labels, task specifications, or goals during the learning phase. In the CLASP (Composable Learned Action Space Predictor) framework, this is achieved by observing an agent interacting with its environment (e.g., via video), and using a generative model to parse out only those aspects of scene changes that carry “action-like” dynamics.
Implementing CLASP: Learning a Task-Agnostic Latent Action Space
1. Stochastic Video Prediction Model
At its core, CLASP employs a variational, stochastic video prediction ° architecture. Here’s how this works in practice:
- Input: A sequence of images representing the agent’s interaction with the environment.
- Encoder/Inference network: For each , infer a latent from . This latent is expected to capture the change—essentially, what “action” caused the scene to transform from to .
- Generative Model: Predict from and via a decoder.
Minimality Constraint: The latent is regularized (see below) so it contains only information related to scene change, not static appearance.
Composability °: A learned composition function is trained such that concatenating two action latents mimics applying their underlying physical actions sequentially. This compositional property ° ensures the latent is action-structured—not entangled with static content.
Implementation sketch:
1 2 3 4 5 6 7 |
def predict_next_frame(x_tm1, x_t, encoder, decoder, kl_beta): z_t_mu, z_t_logvar = encoder(x_tm1, x_t) z_t = sample_from_normal(z_t_mu, z_t_logvar) x_t_pred = decoder(x_tm1, z_t) pred_loss = mse_loss(x_t_pred, x_t) kl_loss = kl_divergence(z_t_mu, z_t_logvar, prior_mu=0, prior_logvar=0) return pred_loss + kl_beta * kl_loss |
2. Information Bottleneck and Loss Functions
CLASP extends a VAE ° with additional information bottleneck ° objectives. The key losses:
- Prediction loss: Standard VAE ELBO ° where the objective is to reconstruct (predict) the next frame.
- Minimality loss (): Penalize mutual information between and ; keep as small as possible while allowing accurate prediction.
- Composability loss (): For sequences, enforce that can be composed into a “trajectory” latent via , and that is minimally informative while still reconstructing end states. This enforces the latent action structure to support composition.
Practical loss implementation (from the paper’s equations):
[ \mathcal{L}{pred}{\theta,\phi}(\mathbf{x}{1:T}) = \sum_{t=1}T \left[ \mathbb{E}{q\phi(\mathbf{z}{2:t}|\mathbf{x}{1:t})} \log p_\theta(x_t|\mathbf{x}{1:t-1}, \mathbf{z}{2:t}) - \beta_z D_{KL}(q_\phi(z_t|\mathbf{x}_{t-1:t}) || p(z)) \right] ]
[ \mathcal{L}{comp}{\theta,\phi,\zeta}(\mathbf{x}{1:T}) = \sum_{t=1}{T_C} \left[ \mathbb{E}{q{\phi,\zeta}(\nu_{1:t}|\mathbf{x}{1:T})} \log p\theta(x_{t_T}|\mathbf{x}{1:(t-1)T_C}, \nu{1:t}) - \beta_\nu D_{KL}(q_{\phi,\zeta}(\nu_t|\mathbf{x}_{(t-1)T_C:tT_C}) || p(\nu)) \right] ] with total loss .
3. Semi-Supervised Action Mapping for Downstream Use
After learning a task-agnostic , CLASP can be fine-tuned to recover true actions or conditioned action prediction ° with dramatically fewer labels:
- Collect a small set of action-labeled sequences.
- Train two MLPs: One maps actions to (), and one maps latents to actions ().
- No gradients are propagated into the main model—only these heads are trained, maintaining the purely unsupervised “core”.
Implementation example:
1 2 3 4 5 6 7 8 |
optimizer = Adam(mlp_lat.parameters()) for (x_tm1, x_t, u) in labeled_dataset: z_t = encode_action_step(x_tm1, x_t) # Fixed, from CLASP encoder z_pred = mlp_lat(u) loss = mse_loss(z_pred, z_t) optimizer.zero_grad() loss.backward() optimizer.step() |
4. Applications and Real-World Integration
- Passive Robot Learning: Robots can watch unlabeled videos (e.g., of humans, other robots, or themselves in different embodiments) and build an inventory of actionable latent effects. This is highly data-efficient for environments where collecting action labels is expensive or infeasible.
- Rapid Adaptation: In new domains—new backgrounds, lighting, or even different agent shapes—the latent space remains robust because it is disentangled from static content.
- Planning: Downstream, CLASP supports action-conditioned video prediction and planning (e.g., visual servoing), even matching supervised baseline performance ° with very little labeled data (e.g., mean angle error of 2.9° on the reacher vs 2.6° for fully supervised).
Empirical results:
Task | CLASP (semi-sup) | Supervised Baseline |
---|---|---|
Reacher, Angle Error | 2.9° | 2.6° |
Push, MAE ° | 0.025 | 0.024 |
But CLASP uses orders of magnitude fewer labeled examples °.
5. Deployment Considerations and Trade-offs
- Compute needs: Comparable to VAEs ° plus an extra composition head—efficient on standard GPUs ° for short video clips/sequences.
- Scale: For higher dimensional scenes or longer horizons, training remains tractable, but evaluate latent bottleneck size and composition function capacity.
- Robustness: Strong robustness across backgrounds and agents. The model specifically avoids overfitting action representations to background/static features.
- Limitations: The representation is only as expressive as the agent’s observed behaviors; very rare ° or unseen actions require more coverage in the dataset.
Summary Table: CLASP vs. Fully Supervised Approaches
Criterion | CLASP (Semi-supervised) | Fully Supervised Models ° |
---|---|---|
Labeled action data | Very few sequences needed | All or most data labeled |
Task/appearance independence | Yes | No |
Robustness to environment variation | Strong | Often weak |
Performance (prediction, planning) | Near parity or better | Strong (with labels) |
Reusability / transfer | High | Low–moderate |
Practical Steps to Apply CLASP
- Collect passive video data of agents performing diverse behaviors.
- Train a stochastic video prediction VAE with minimality and composability/compositionality objectives as above.
- (Optional) Label a small subset of sequences with action commands; fit lightweight mapping heads.
- Deploy for downstream applications: Visual servoing, imitation-from-observation, or as a general action space prior for transfer learning or meta-learning.
For robotics, computer vision, or imitation learning systems requiring robust, scalable action representations without burdensome labeling, CLASP provides an efficient, adaptable, and extensible solution with strong empirical backing.