Papers
Topics
Authors
Recent
Search
2000 character limit reached

Central Latent Action Spaces (CLAS)

Updated 4 March 2026
  • CLAS are learned, lower-dimensional manifolds that represent complex agent actions using latent variables across domains such as robotics, control, and language.
  • They employ methods like variational autoencoders, contrastive losses, and quantization to encode, decode, and optimize actions in a shared, compact space.
  • CLAS enable efficient policy transfer, improved generalization, and robust continual learning, addressing challenges in multi-agent coordination and dynamic environments.

A Central Latent Action Space (CLAS) is a shared, lower-dimensional, learned space in which complex agent actions—spanning robotics, control, language agents, and video world modeling—are represented as latent variables rather than task-specific, user-designed, or direct actuator commands. CLAS frameworks encode domain actions via probabilistic or deterministic mappings into this latent space, support decoding or composition back to environment-specific actions, and are increasingly foundational in transfer, generalization, and @@@@1@@@@ across diverse and dynamic settings.

1. Formal Definitions and Core Structure

A CLAS is a learned manifold, typically denoted ZRm\mathcal{Z} \subseteq \mathbb{R}^m or as a discrete codebook, that acts as an interface between high-level policies and environment-specific control. In the multi-agent setting, CLAS compresses the composite action u=(u1,,uN)\mathbf{u} = (\mathbf{u}_1, \ldots, \mathbf{u}_N) of NN agents into a central latent zz, with per-agent decoders fdec(i)f^{(i)}_{\mathrm{dec}} mapping zz and local observations back to executable commands. In sequential decision-making for a single agent or dialog system, CLAS posits a latent zpθ(zc)z \sim p_\theta(z|c), where context cc encodes agent state, and subsequent generation or policy learning occurs in this compact latent space (Aljalbout et al., 2022, Zhao et al., 2019).

In continual learning, the CLAS is a shared action embedding space (ERd\mathcal{E} \subset \mathbb{R}^d) into which all possible environment actions across a sequence of tasks are mapped, allowing a single high-level policy to be learned independently of the currently-valid environment action set (Pan et al., 6 Jun 2025). For control in robotics and world modeling, CLAS typically arises as a latent variable inferred from video or trajectory transitions using inverse dynamics, variational autoencoding, or contrastive quantization (Rybkin et al., 2018, Jiang et al., 10 Feb 2026, Zhang et al., 7 Jan 2026).

2. Learning and Regularizing the Central Latent Action Space

Learning a robust and interpretable central latent action representation requires navigation of key trade-offs: informativeness, minimality, disentanglement, and composability. Most approaches adopt variational autoencoders (VAE) or related probabilistic models:

LELBO=Eqϕ(zx)[logpθ(xz)]βKL[qϕ(zx)p(z)]\mathcal{L}_{\text{ELBO}} = \mathbb{E}_{q_\phi(z|x)} [\log p_\theta(x|z)] - \beta\,\mathrm{KL}[q_\phi(z|x)\|p(z)]

with β\beta tuning the information bottleneck (strong β\beta for minimal/static-content-agnostic latents) (Rybkin et al., 2018, Zhao et al., 2019).

  • Composability Losses: Enforce centrality via composability—requiring that blocks of latent actions compose (via an MLP or even explicit group priors) into trajectory descriptors that encode only sequence effect, not static content. This is achieved through a secondary latent ν\nu trained similarly to zz, enforcing group-like composition (Rybkin et al., 2018).
  • Contrastive and Alignment Losses: To ensure cross-modal alignment and semantic effect transfer, methods such as CLAP and Olaf-World employ contrastive losses between features extracted from video, proprioceptive, and robot modalities. Olaf-World’s SeqΔ\Delta-REPA loss aligns cumulative latent actions with the net temporal feature change from a frozen representation, optimizing

LSeqΔREPA=1norm(hψ(zˉ)),norm(τ)\mathcal{L}_{\mathrm{Seq}\Delta-\mathrm{REPA}} = 1 - \langle\mathrm{norm}(h_\psi(\bar{z})),\, \mathrm{norm}(\tau_*)\rangle

where zˉ\bar{z} is the mean latent, τ\tau_* is the effect direction in a semantic embedding space, and hψh_\psi is a linear projector (Jiang et al., 10 Feb 2026).

3. Architectures and Decoding Mechanics

CLAS implementations vary by domain but typically include:

  • Encoder fencf_{\rm enc}: Maps observed transitions (observations + actions) to zZz\in\mathcal{Z}, instantiated by transformers (for language or video), CNNs+MLPs (for states and images), or recurrent models.
  • Decoders {fdec(i)}\{f^{(i)}_{\rm dec}\}: For multi-agent or multi-task scenarios, separate decoders map zz and current agent/local observations to native actuator or action space (Aljalbout et al., 2022, Pan et al., 6 Jun 2025).
  • Latent Action Policy: High-level policies πθ(s)Z\pi_\theta(s) \to \mathcal{Z} produced by MLPs, transformers, or small networks, enabling the agent to act in the central latent space, with downstream decoding to environment actions.
  • Merge and Injection Modules: In LLMs, CLAS codes are injected into transformer layers via concatenation and merge-MLPs, allowing the latent action to steer generation or hidden states (Jia et al., 27 Mar 2025).
  • Quantization and Cross-modal Alignment: CLAP-style VQ-VAEs and contrastive modules ensure that video- and action-inferred latents are associated with the same codebook, enabling cross-modal skill transfer (Zhang et al., 7 Jan 2026).

4. Policy Optimization and Generalization

Policies operating in CLAS are trained via RL (or behavior cloning), where only the encoder/policy mapping or prior pψ(zo)p_\psi(z|o) is updated, and decoders/scripts are fixed or minimally updated for each new context:

  • RL in Latent Space: SAC, PPO, or other methods optimize returns over sequences of latent actions, with gradient updates confined to latent policy parameters. Policy gradient updates utilize the score-function estimator for discrete/quantized latents and reparameterization for continuous variants (Aljalbout et al., 2022, Zhao et al., 2019, Jia et al., 27 Mar 2025).
  • Continual Learning and Adaptation: In dynamic or non-stationary action spaces, CLAS enables forward transfer and near-zero forgetting. Adaptation is localized to the decoder (e.g., via EWC regularization); the policy in latent space is untouched, ensuring knowledge accumulation across tasks (Pan et al., 6 Jun 2025).
  • Sample Efficiency and Robustness: Across domains, CLAS substantially reduces required dataset size and label supervision. For example, in visual servoing, CLASP achieves comparable performance to fully supervised models with 100×\times fewer labeled sequences (Rybkin et al., 2018). In multi-robot manipulation, CLAS converges 3×3\times faster than decentralized baselines and is robust to external disturbances (Aljalbout et al., 2022).

5. Cross-Domain, Multi-Agent, and Multi-Modal Applications

CLAS methodologies are prominent in:

  • Multi-Agent Coordination: In robot manipulation, a low-dimensional latent zz mediates synchronized action for multiple agents, overcoming sample complexity and exploration bottlenecks in high DoF systems (Aljalbout et al., 2022).
  • Continual and Lifelong RL: Dynamic tasks with evolving action sets benefit from a stable latent policy, with only decoders requiring adaptation or elasticity (Pan et al., 6 Jun 2025).
  • Vision-Language-Action Models: CLAP leverages contrastive CLAS quantization to unify human video and robot proprioceptive data, enabling instruction following, cross-domain generalization, and robust policy finetuning under catastrophic forgetting constraints (Zhang et al., 7 Jan 2026).
  • Dialog Agents and LLMs: Treating latent action as the dialogue agent's action enables smoother RL training, less exposure bias, and better trade-offs in response diversity and task utility compared to token-level RL (Zhao et al., 2019, Jia et al., 27 Mar 2025).
  • Video World Modeling and Transfer: Olaf-World demonstrates that anchoring CLAS to observable semantic effects allows for effective zero-shot action transfer, robust domain adaptation, and strong performance under severe OOD shifts (Jiang et al., 10 Feb 2026).

6. Empirical Performance, Ablations, and Best Practices

Quantitative evaluations of CLAS consistently show improvements over baselines in sample efficiency, diversity, generalization, and stability.

  • Coordination/Transfer: CLAS outperforms one-agent and fully decentralized multi-agent RL approaches in both final performance and robustness to disturbances (Aljalbout et al., 2022).
  • Continual RL: CLAS achieves higher mean continual return and forward transfer, with nearly zero forgetting, across MiniGrid, Procgen, and Atari action-set expansion/contraction tasks (Pan et al., 6 Jun 2025).
  • Generalization Metrics (Table format):
Domain CLAS Method Key Metric(s) Baseline(s) Empirical Result
Multi-Robot (Aljalbout et al., 2022) Convergence steps, robustness FULL_DEC, SHARED_Q CLAS 2.5e5 steps (\sim3x faster); up to 100% robust
Continual RL (Pan et al., 6 Jun 2025) Continual return, forgetting FT, EWC, CLEAR CLAS R=0.9R=0.9, F0F\approx 0, highest T
World Modeling (Jiang et al., 10 Feb 2026) Macro-F1, RPE AdaWorld \sim30% higher F1; lower RPE at all data levels
LLMs (Jia et al., 27 Mar 2025) Math500 %, semantic diversity Token RL 42.4% vs 38.2% (RL), diversity ratio 1.3x
Video to Robot (Zhang et al., 7 Jan 2026) Precision, generalization VLA baselines Significant outperformance on transfer, OOD

Ablations confirm that regularization (e.g., via Δ\Delta alignment, KL penalties, or contrastive objectives) and quantization are critical for cross-context centrality and zero-shot adaptation. Discrete latent spaces tend to be more stable in policy RL than unbounded continuous spaces (Zhao et al., 2019).

7. Limitations and Future Directions

CLAS imposes significant structure on agent behavior and enables powerful transfer and generalization, but current limitations include:

Continued development of CLAS is expected to focus on hierarchical structure, cross-modal scaling, policy-adaptive decoders, integration with large-scale self-supervised learning, and rigorous theory for compositional action representations. CLAS has already established itself as a unifying principle for efficient, generalizable, and robust policy learning in complex, real-world domains.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Central Latent Action Spaces (CLAS).