Central Latent Action Spaces (CLAS)

Updated 4 March 2026

CLAS are learned, lower-dimensional manifolds that represent complex agent actions using latent variables across domains such as robotics, control, and language.
They employ methods like variational autoencoders, contrastive losses, and quantization to encode, decode, and optimize actions in a shared, compact space.
CLAS enable efficient policy transfer, improved generalization, and robust continual learning, addressing challenges in multi-agent coordination and dynamic environments.

A Central Latent Action Space (CLAS) is a shared, lower-dimensional, learned space in which complex agent actions—spanning robotics, control, language agents, and video world modeling—are represented as latent variables rather than task-specific, user-designed, or direct actuator commands. CLAS frameworks encode domain actions via probabilistic or deterministic mappings into this latent space, support decoding or composition back to environment-specific actions, and are increasingly foundational in transfer, generalization, and @@@@1@@@@ across diverse and dynamic settings.

1. Formal Definitions and Core Structure

A CLAS is a learned manifold, typically denoted $\mathcal{Z} \subseteq \mathbb{R}^m$ or as a discrete codebook, that acts as an interface between high-level policies and environment-specific control. In the multi-agent setting, CLAS compresses the composite action $\mathbf{u} = (\mathbf{u}_1, \ldots, \mathbf{u}_N)$ of $N$ agents into a central latent $z$ , with per-agent decoders $f^{(i)}_{\mathrm{dec}}$ mapping $z$ and local observations back to executable commands. In sequential decision-making for a single agent or dialog system, CLAS posits a latent $z \sim p_\theta(z|c)$ , where context $c$ encodes agent state, and subsequent generation or policy learning occurs in this compact latent space (Aljalbout et al., 2022, Zhao et al., 2019).

In continual learning, the CLAS is a shared action embedding space ( $\mathcal{E} \subset \mathbb{R}^d$ ) into which all possible environment actions across a sequence of tasks are mapped, allowing a single high-level policy to be learned independently of the currently-valid environment action set (Pan et al., 6 Jun 2025). For control in robotics and world modeling, CLAS typically arises as a latent variable inferred from video or trajectory transitions using inverse dynamics, variational autoencoding, or contrastive quantization (Rybkin et al., 2018, Jiang et al., 10 Feb 2026, Zhang et al., 7 Jan 2026).

2. Learning and Regularizing the Central Latent Action Space

Learning a robust and interpretable central latent action representation requires navigation of key trade-offs: informativeness, minimality, disentanglement, and composability. Most approaches adopt variational autoencoders (VAE) or related probabilistic models:

Latent Variable ELBO: Maximize the evidence lower bound

$\mathcal{L}_{\text{ELBO}} = \mathbb{E}_{q_\phi(z|x)} [\log p_\theta(x|z)] - \beta\,\mathrm{KL}[q_\phi(z|x)\|p(z)]$

with $\beta$ tuning the information bottleneck (strong $\beta$ for minimal/static-content-agnostic latents) (Rybkin et al., 2018, Zhao et al., 2019).

Composability Losses: Enforce centrality via composability—requiring that blocks of latent actions compose (via an MLP or even explicit group priors) into trajectory descriptors that encode only sequence effect, not static content. This is achieved through a secondary latent $\nu$ trained similarly to $z$ , enforcing group-like composition (Rybkin et al., 2018).
Contrastive and Alignment Losses: To ensure cross-modal alignment and semantic effect transfer, methods such as CLAP and Olaf-World employ contrastive losses between features extracted from video, proprioceptive, and robot modalities. Olaf-World’s Seq $\Delta$ -REPA loss aligns cumulative latent actions with the net temporal feature change from a frozen representation, optimizing

$\mathcal{L}_{\mathrm{Seq}\Delta-\mathrm{REPA}} = 1 - \langle\mathrm{norm}(h_\psi(\bar{z})),\, \mathrm{norm}(\tau_*)\rangle$

where $\bar{z}$ is the mean latent, $\tau_*$ is the effect direction in a semantic embedding space, and $h_\psi$ is a linear projector (Jiang et al., 10 Feb 2026).

Codebook Learning and Quantization: Discretized CLAS variants (e.g., CoLA and CLAP) learn a codebook of latent actions (size $N$ , embedding dimension $d$ ), enabling more stable and data-efficient RL, and cross-modal transfer via quantization and Gumbel-Softmax or vector quantization (Jia et al., 27 Mar 2025, Zhang et al., 7 Jan 2026).

3. Architectures and Decoding Mechanics

CLAS implementations vary by domain but typically include:

Encoder $f_{\rm enc}$ : Maps observed transitions (observations + actions) to $z\in\mathcal{Z}$ , instantiated by transformers (for language or video), CNNs+MLPs (for states and images), or recurrent models.
Decoders $\{f^{(i)}_{\rm dec}\}$ : For multi-agent or multi-task scenarios, separate decoders map $z$ and current agent/local observations to native actuator or action space (Aljalbout et al., 2022, Pan et al., 6 Jun 2025).
Latent Action Policy: High-level policies $\pi_\theta(s) \to \mathcal{Z}$ produced by MLPs, transformers, or small networks, enabling the agent to act in the central latent space, with downstream decoding to environment actions.
Merge and Injection Modules: In LLMs, CLAS codes are injected into transformer layers via concatenation and merge-MLPs, allowing the latent action to steer generation or hidden states (Jia et al., 27 Mar 2025).
Quantization and Cross-modal Alignment: CLAP-style VQ-VAEs and contrastive modules ensure that video- and action-inferred latents are associated with the same codebook, enabling cross-modal skill transfer (Zhang et al., 7 Jan 2026).

4. Policy Optimization and Generalization

Policies operating in CLAS are trained via RL (or behavior cloning), where only the encoder/policy mapping or prior $p_\psi(z|o)$ is updated, and decoders/scripts are fixed or minimally updated for each new context:

RL in Latent Space: SAC, PPO, or other methods optimize returns over sequences of latent actions, with gradient updates confined to latent policy parameters. Policy gradient updates utilize the score-function estimator for discrete/quantized latents and reparameterization for continuous variants (Aljalbout et al., 2022, Zhao et al., 2019, Jia et al., 27 Mar 2025).
Continual Learning and Adaptation: In dynamic or non-stationary action spaces, CLAS enables forward transfer and near-zero forgetting. Adaptation is localized to the decoder (e.g., via EWC regularization); the policy in latent space is untouched, ensuring knowledge accumulation across tasks (Pan et al., 6 Jun 2025).
Sample Efficiency and Robustness: Across domains, CLAS substantially reduces required dataset size and label supervision. For example, in visual servoing, CLASP achieves comparable performance to fully supervised models with 100 $\times$ fewer labeled sequences (Rybkin et al., 2018). In multi-robot manipulation, CLAS converges $3\times$ faster than decentralized baselines and is robust to external disturbances (Aljalbout et al., 2022).

CLAS methodologies are prominent in:

Multi-Agent Coordination: In robot manipulation, a low-dimensional latent $z$ mediates synchronized action for multiple agents, overcoming sample complexity and exploration bottlenecks in high DoF systems (Aljalbout et al., 2022).
Continual and Lifelong RL: Dynamic tasks with evolving action sets benefit from a stable latent policy, with only decoders requiring adaptation or elasticity (Pan et al., 6 Jun 2025).
Vision-Language-Action Models: CLAP leverages contrastive CLAS quantization to unify human video and robot proprioceptive data, enabling instruction following, cross-domain generalization, and robust policy finetuning under catastrophic forgetting constraints (Zhang et al., 7 Jan 2026).
Dialog Agents and LLMs: Treating latent action as the dialogue agent's action enables smoother RL training, less exposure bias, and better trade-offs in response diversity and task utility compared to token-level RL (Zhao et al., 2019, Jia et al., 27 Mar 2025).
Video World Modeling and Transfer: Olaf-World demonstrates that anchoring CLAS to observable semantic effects allows for effective zero-shot action transfer, robust domain adaptation, and strong performance under severe OOD shifts (Jiang et al., 10 Feb 2026).

6. Empirical Performance, Ablations, and Best Practices

Quantitative evaluations of CLAS consistently show improvements over baselines in sample efficiency, diversity, generalization, and stability.

Coordination/Transfer: CLAS outperforms one-agent and fully decentralized multi-agent RL approaches in both final performance and robustness to disturbances (Aljalbout et al., 2022).
Continual RL: CLAS achieves higher mean continual return and forward transfer, with nearly zero forgetting, across MiniGrid, Procgen, and Atari action-set expansion/contraction tasks (Pan et al., 6 Jun 2025).
Generalization Metrics (Table format):

Domain	CLAS Method	Key Metric(s)	Baseline(s)	Empirical Result
Multi-Robot	(Aljalbout et al., 2022)	Convergence steps, robustness	FULL_DEC, SHARED_Q	CLAS 2.5e5 steps ( $\sim$ 3x faster); up to 100% robust
Continual RL	(Pan et al., 6 Jun 2025)	Continual return, forgetting	FT, EWC, CLEAR	CLAS $R=0.9$ , $F\approx 0$ , highest T
World Modeling	(Jiang et al., 10 Feb 2026)	Macro-F1, RPE	AdaWorld	$\sim$ 30% higher F1; lower RPE at all data levels
LLMs	(Jia et al., 27 Mar 2025)	Math500 %, semantic diversity	Token RL	42.4% vs 38.2% (RL), diversity ratio 1.3x
Video to Robot	(Zhang et al., 7 Jan 2026)	Precision, generalization	VLA baselines	Significant outperformance on transfer, OOD

Ablations confirm that regularization (e.g., via $\Delta$ alignment, KL penalties, or contrastive objectives) and quantization are critical for cross-context centrality and zero-shot adaptation. Discrete latent spaces tend to be more stable in policy RL than unbounded continuous spaces (Zhao et al., 2019).

7. Limitations and Future Directions

CLAS imposes significant structure on agent behavior and enables powerful transfer and generalization, but current limitations include:

Difficulty scaling to extremely high-DoF or long-horizon compositional domains without hierarchical or explicitly structured composition operators (Rybkin et al., 2018).
Dependence on the expressivity and proper regularization of the encoder-decoder, especially in continually evolving environments (Pan et al., 6 Jun 2025).
Need for improved alignment and disentanglement methods to enable more principled group-theoretic latent action structures (Rybkin et al., 2018, Zhang et al., 7 Jan 2026).
Distinguishing semantic effect alignment from style and environmental confounders, especially in vision-based learning from third-person video (Jiang et al., 10 Feb 2026).

Continued development of CLAS is expected to focus on hierarchical structure, cross-modal scaling, policy-adaptive decoders, integration with large-scale self-supervised learning, and rigorous theory for compositional action representations. CLAS has already established itself as a unifying principle for efficient, generalizable, and robust policy learning in complex, real-world domains.