Contrastive Latent Action Pretraining (CLAP)
- CLAP is a method that employs contrastive learning to extract disentangled, actionable latent representations from high-dimensional video and audio data.
- It separates action-relevant features from visual and acoustic distractors, enabling robust policy transfer in robotics and precise audio query alignment.
- The framework combines multi-modal contrastive losses with reconstruction objectives, improving zero-shot performance in both robotic control and paralinguistic tasks.
Contrastive Latent Action Pretraining (CLAP) refers to a family of methods that leverage contrastive learning to acquire disentangled, actionable latent representations from high-dimensional sequential dataāmost prominently video and audioāfor downstream robotic control or paralinguistic tasks. In robotics, CLAP enables effective policy transfer from large-scale human demonstration videos to embodied robotic agents by constructing action-relevant latent spaces that are physically executable and semantically meaningful. In the audio domain, CLAP aligns acoustic events with language queries to facilitate generalizable, open-vocabulary paralinguistic analysis. The core innovation across domains is the use of contrastive objectives to force the model to learn latents that are robust to nuisance variation and closely related to task-relevant semantic content.
1. Problem Setting and Motivation
In the context of vision-language-action (VLA) models for robotics, annotated real-robot trajectories are scarce and expensive to obtain, limiting the scalability of generalist policies. Human video demonstrations, by contrast, are abundant but lack explicit action supervision. Unsupervised or VQ-VAE-based latent action models trained on such data often capture superficial visual features and suffer from entangled representations that do not transfer well to physical control. CLAP formulations address visual entanglement by learning a latent action space where action semantics are isolated from visual distractors and are directly alignable with robot proprioceptive trajectories (Dai et al., 31 Jan 2026, Zhang et al., 7 Jan 2026).
In the audio domain, standard CLAP models generalize the contrastive vision-language approach to language-audio pairs, extending beyond closed-set classification to allow open-ended, query-based retrieval and analysis. Adapted models for computational paralinguistic tasks (e.g., emotion, speaker state) further demonstrate that careful tailoring of pretraining data and query templates is essential to achieve strong performance in zero-shot, label-efficient settings (Jing et al., 2024).
2. Model Architectures and Latent Structure
Visual-Latent CLAP for Robotics
CLAP models for vision-language-action employ spatial-temporal transformer encoders to process frame pairs or trajectories, producing high-dimensional latent embeddings:
is split into action-related () and visual-related () components, each passed through separate two-layer MLP heads to produce (action) and (visual). The action embedding is quantized via a learned codebook (VQ-VAE style), yielding discrete latent tokens corresponding to "latent actions" . The decoder reconstructs future visual observations, but learns to do so using only the quantized action codes and the current image, enforcing that must encode dynamic (not static) content (Dai et al., 31 Jan 2026).
A separate Act-VAE is trained on robot proprioceptive action trajectories to define a physically grounded latent codebook. Subsequent video-based latents are aligned to this codebook by contrastive learning, constructing a direct mapping from human-demonstrated motion segments to executable robot actions (Zhang et al., 7 Jan 2026).
Audio-Latent CLAP
CLAP models for audio use a two-tower architecture: an audio encoder 0 and a text encoder 1, each projecting their modality into a shared 2-dimensional latent space via learned MLP heads:
3
Latent similarity is measured with cosine similarity, and the contrastive InfoNCE loss aligns matching audio-text pairs while repelling mismatches. This architecture permits generalization to arbitrary, open-vocabulary text queries at inference (Jing et al., 2024).
3. Contrastive Objectives and Disentanglement Mechanisms
The key innovation in CLAP is the use of contrastive losses to achieve disentanglement and transferability:
- Action-centric supervised contrastive loss: Forces 4 to cluster by action category across diverse visual contexts, using pseudo-labels or codebook indices as anchors:
5
where 6 indexes same-category samples, 7 all others (Dai et al., 31 Jan 2026).
- Vision-centric unsupervised contrastive loss: Separates visual appearance from motion by applying InfoNCE to samples differing only in frame order ("normal" vs. "reversed"), which alters motion but preserves static content:
8
- Cross-modal dynamics alignment: For robot transfer, CLAP aligns video-inferred latents and robot proprioceptive latents using a SigLIP-style contrastive loss between video-derived 9 and robot-latent 0, enforcing a one-to-one mapping between human-observable motions and executable actions (Zhang et al., 7 Jan 2026).
- Symmetric InfoNCE for Audio-Language: Applies a bidirectional contrastive loss over pairs of audio and text queries, maximizing the similarity of true pairs and minimizing that of impostors (Jing et al., 2024).
These losses are always combined with pixel-level or observation reconstruction losses to preserve dynamic consistency, with hyperparameter choices dictating the exact tradeoff.
4. Training Protocols, Data, and Codebook Design
Robotics
- Data sources: Large-scale human demonstration datasets such as Something-SomethingāV2 (ā220āK clips, 174 categories) and real robot teleoperation (BridgeV2, AgiBot World, Astribot S1 VR) (Dai et al., 31 Jan 2026, Zhang et al., 7 Jan 2026).
- Pseudo-label generation: Actions are coarsely labeled by extracting verbs and directions from dataset annotations or instructions, mapped to 80ā174 action classes (Dai et al., 31 Jan 2026).
- Pretraining stages:
- Learn quantized latent actions from human video with contrastive disentanglement.
- Train an autoregressive VLM (e.g., Large World Modelā7B, Qwen3VL-4B) to predict these tokens from the present image and instruction, freezing the vision encoder and fine-tuning only the LLM.
- Attach a new head to map VLM outputs to real robot joint-space actions, finetuned with mean-squared error on a small set (100ā150) of real demonstrations (Dai et al., 31 Jan 2026, Zhang et al., 7 Jan 2026).
- Codebook design: Act-VAE codebooks are tuned (e.g., 8 for ConLA, 256 for CLAP) for the best rate-distortion tradeoff.
Audio
- Data sources: MSP-Podcast corpus for emotion-rich English speech (~110āh, 9 emotions), paired with query templates generated from categorical labels and acoustic feature descriptors.
- Query generation: Expert-designed templates combine categorical and prosodic terms (e.g., "pitch is high") with random conjunctions to enable fine-grained alignment (Jing et al., 2024).
- Model and optimization: wav2vec2.0-large or CNN14 audio encoder; BERT-base text encoder; two-layer projection MLPs; InfoNCE loss with learnable temperature; Adam optimizer with staged learning rates (Jing et al., 2024).
5. Empirical Results and Benchmarks
Robotics
Experiments consistently demonstrate that CLAP-pretrained policies, given only human videos for pretraining, outperform models pretrained on robot trajectories or standard VQ-VAE (LAPA) baselines. Representative results:
- Simulation (WidowX/Bridge tasks): ConLA achieves 64.6% average success, compared to LAPA at 52.1% and ActionVLA (robot pretraining) at 63.5% (Dai et al., 31 Jan 2026).
- Real robot (Franka Panda, tabletop tasks): ConLA (48.2%) outperforms LAPA (32.3%) and ActionVLA (~31%) when fine-tuned with 150 real demonstrations (Dai et al., 31 Jan 2026).
- Large-scale real world (Astribot S1, CLAP-RF head): 61% mean success on bimanual manipulation compared to 54% baseline (Zhang et al., 7 Jan 2026).
- Generalization: CLAP-NTP head achieves 85/80% on OOD pick-and-place and 35% (up from 0%) on unseen bouquet tasks (Zhang et al., 7 Jan 2026).
- Ablations: Removing contrastive losses or human videos significantly degrades OOD and average performance (ā15% and ā11.3%, respectively) (Zhang et al., 7 Jan 2026).
- Data scaling: Performance increases with more human video, e.g., ConLA at 10/50/100% data achieves 58.3/60.4/64.6% (Dai et al., 31 Jan 2026).
Audio
- Zero-shot emotion recognition: ParaCLAP outperforms generic CLAP and Pengi baselines across multiple datasets (e.g., +21 points UAR on IEMOCAP) when trained with proper emotion queries (Jing et al., 2024).
- Generalization to mixed/cross-domain tasks: More diverse queries boost performance on tasks with mismatched label sets (e.g., 5-class FAU-Aibo) (Jing et al., 2024).
A sample of results is summarized in the table below:
| Task/System | Baseline UAR | ParaCLAP UAR |
|---|---|---|
| IEMOCAP (4-way) | 0.353ā0.345 | 0.567 |
| TESS (7-way) | 0.177ā0.232 | 0.484 |
| FAU-Aibo (2-way) | 0.470ā0.500 | 0.604 |
6. Limitations and Open Challenges
CLAP, while representing a major advance for label-efficient, cross-modal generalization, has several limitations:
- Robotic grounding: Some real robot data is always necessary for final action mapping; purely video-to-actuator pipelines remain open (Zhang et al., 7 Jan 2026).
- Pseudo-label reliance: Visual CLAP approaches require coarse action pseudo-labels, currently extracted with hand-tuned templates (Dai et al., 31 Jan 2026).
- Semantic ambiguities: Mapping dexterous human hand movements to robot effectors is fundamentally ambiguous, especially for highly dexterous or non-anthropomorphic robots (Zhang et al., 7 Jan 2026).
- Pipeline complexity: Most described architectures are multi-stage; fully end-to-end differentiable CLAP remains an engineering challenge (Zhang et al., 7 Jan 2026).
- Data diversity: Both vision and audio CLAP models can be limited by training set biases, annotation sparsity, and lack of open-vocabulary coverage in paralinguistic descriptors (Jing et al., 2024).
7. Broader Impact and Future Directions
CLAP represents a paradigm shift in leveraging unannotated or weakly annotated internet-scale data for embodied policy learning and fine-grained audio understanding. In robotics, this approach enables state-of-the-art generalist policies with drastically reduced reliance on robot-specific demonstrations, leveraging human priors at scale. In paralinguistics, the method enables zero-shot open-ended language-based labeling, subject to continued advances in query generation and cross-lingual robustness.
Emerging research directions for CLAP include expanding grounding signals (force, touch), tighter integration with LLMs for both query synthesis and embodiment transfer, and building larger, more diverse paired datasets in both video and audio domains (Dai et al., 31 Jan 2026, Zhang et al., 7 Jan 2026, Jing et al., 2024).