Papers
Topics
Authors
Recent
Search
2000 character limit reached

Contrastive Latent Action Pretraining (CLAP)

Updated 3 July 2026
  • CLAP is a method that employs contrastive learning to extract disentangled, actionable latent representations from high-dimensional video and audio data.
  • It separates action-relevant features from visual and acoustic distractors, enabling robust policy transfer in robotics and precise audio query alignment.
  • The framework combines multi-modal contrastive losses with reconstruction objectives, improving zero-shot performance in both robotic control and paralinguistic tasks.

Contrastive Latent Action Pretraining (CLAP) refers to a family of methods that leverage contrastive learning to acquire disentangled, actionable latent representations from high-dimensional sequential data—most prominently video and audio—for downstream robotic control or paralinguistic tasks. In robotics, CLAP enables effective policy transfer from large-scale human demonstration videos to embodied robotic agents by constructing action-relevant latent spaces that are physically executable and semantically meaningful. In the audio domain, CLAP aligns acoustic events with language queries to facilitate generalizable, open-vocabulary paralinguistic analysis. The core innovation across domains is the use of contrastive objectives to force the model to learn latents that are robust to nuisance variation and closely related to task-relevant semantic content.

1. Problem Setting and Motivation

In the context of vision-language-action (VLA) models for robotics, annotated real-robot trajectories are scarce and expensive to obtain, limiting the scalability of generalist policies. Human video demonstrations, by contrast, are abundant but lack explicit action supervision. Unsupervised or VQ-VAE-based latent action models trained on such data often capture superficial visual features and suffer from entangled representations that do not transfer well to physical control. CLAP formulations address visual entanglement by learning a latent action space where action semantics are isolated from visual distractors and are directly alignable with robot proprioceptive trajectories (Dai et al., 31 Jan 2026, Zhang et al., 7 Jan 2026).

In the audio domain, standard CLAP models generalize the contrastive vision-language approach to language-audio pairs, extending beyond closed-set classification to allow open-ended, query-based retrieval and analysis. Adapted models for computational paralinguistic tasks (e.g., emotion, speaker state) further demonstrate that careful tailoring of pretraining data and query templates is essential to achieve strong performance in zero-shot, label-efficient settings (Jing et al., 2024).

2. Model Architectures and Latent Structure

Visual-Latent CLAP for Robotics

CLAP models for vision-language-action employ spatial-temporal transformer encoders to process frame pairs or trajectories, producing high-dimensional latent embeddings:

Z=IĻ•([Ot,Ot+k])∈RdZ = I_\phi([O_t, O_{t+k}]) \in \mathbb{R}^d

ZZ is split into action-related (Za′Z_{a'}) and visual-related (Zv′Z_{v'}) components, each passed through separate two-layer MLP heads to produce ZaZ_a (action) and ZvZ_v (visual). The action embedding is quantized via a learned codebook CC (VQ-VAE style), yielding discrete latent tokens ZaqZ_{aq} corresponding to "latent actions" ztz_t. The decoder reconstructs future visual observations, but learns to do so using only the quantized action codes and the current image, enforcing that ztz_t must encode dynamic (not static) content (Dai et al., 31 Jan 2026).

A separate Act-VAE is trained on robot proprioceptive action trajectories to define a physically grounded latent codebook. Subsequent video-based latents are aligned to this codebook by contrastive learning, constructing a direct mapping from human-demonstrated motion segments to executable robot actions (Zhang et al., 7 Jan 2026).

Audio-Latent CLAP

CLAP models for audio use a two-tower architecture: an audio encoder ZZ0 and a text encoder ZZ1, each projecting their modality into a shared ZZ2-dimensional latent space via learned MLP heads:

ZZ3

Latent similarity is measured with cosine similarity, and the contrastive InfoNCE loss aligns matching audio-text pairs while repelling mismatches. This architecture permits generalization to arbitrary, open-vocabulary text queries at inference (Jing et al., 2024).

3. Contrastive Objectives and Disentanglement Mechanisms

The key innovation in CLAP is the use of contrastive losses to achieve disentanglement and transferability:

  • Action-centric supervised contrastive loss: Forces ZZ4 to cluster by action category across diverse visual contexts, using pseudo-labels or codebook indices as anchors:

ZZ5

where ZZ6 indexes same-category samples, ZZ7 all others (Dai et al., 31 Jan 2026).

  • Vision-centric unsupervised contrastive loss: Separates visual appearance from motion by applying InfoNCE to samples differing only in frame order ("normal" vs. "reversed"), which alters motion but preserves static content:

ZZ8

(Dai et al., 31 Jan 2026).

  • Cross-modal dynamics alignment: For robot transfer, CLAP aligns video-inferred latents and robot proprioceptive latents using a SigLIP-style contrastive loss between video-derived ZZ9 and robot-latent Za′Z_{a'}0, enforcing a one-to-one mapping between human-observable motions and executable actions (Zhang et al., 7 Jan 2026).
  • Symmetric InfoNCE for Audio-Language: Applies a bidirectional contrastive loss over pairs of audio and text queries, maximizing the similarity of true pairs and minimizing that of impostors (Jing et al., 2024).

These losses are always combined with pixel-level or observation reconstruction losses to preserve dynamic consistency, with hyperparameter choices dictating the exact tradeoff.

4. Training Protocols, Data, and Codebook Design

Robotics

  • Data sources: Large-scale human demonstration datasets such as Something-Something V2 (ā‰ˆ220 K clips, 174 categories) and real robot teleoperation (BridgeV2, AgiBot World, Astribot S1 VR) (Dai et al., 31 Jan 2026, Zhang et al., 7 Jan 2026).
  • Pseudo-label generation: Actions are coarsely labeled by extracting verbs and directions from dataset annotations or instructions, mapped to 80–174 action classes (Dai et al., 31 Jan 2026).
  • Pretraining stages:
  1. Learn quantized latent actions from human video with contrastive disentanglement.
  2. Train an autoregressive VLM (e.g., Large World Model 7B, Qwen3VL-4B) to predict these tokens from the present image and instruction, freezing the vision encoder and fine-tuning only the LLM.
  3. Attach a new head to map VLM outputs to real robot joint-space actions, finetuned with mean-squared error on a small set (100–150) of real demonstrations (Dai et al., 31 Jan 2026, Zhang et al., 7 Jan 2026).
  • Codebook design: Act-VAE codebooks are tuned (e.g., 8 for ConLA, 256 for CLAP) for the best rate-distortion tradeoff.

Audio

  • Data sources: MSP-Podcast corpus for emotion-rich English speech (~110 h, 9 emotions), paired with query templates generated from categorical labels and acoustic feature descriptors.
  • Query generation: Expert-designed templates combine categorical and prosodic terms (e.g., "pitch is high") with random conjunctions to enable fine-grained alignment (Jing et al., 2024).
  • Model and optimization: wav2vec2.0-large or CNN14 audio encoder; BERT-base text encoder; two-layer projection MLPs; InfoNCE loss with learnable temperature; Adam optimizer with staged learning rates (Jing et al., 2024).

5. Empirical Results and Benchmarks

Robotics

Experiments consistently demonstrate that CLAP-pretrained policies, given only human videos for pretraining, outperform models pretrained on robot trajectories or standard VQ-VAE (LAPA) baselines. Representative results:

  • Simulation (WidowX/Bridge tasks): ConLA achieves 64.6% average success, compared to LAPA at 52.1% and ActionVLA (robot pretraining) at 63.5% (Dai et al., 31 Jan 2026).
  • Real robot (Franka Panda, tabletop tasks): ConLA (48.2%) outperforms LAPA (32.3%) and ActionVLA (~31%) when fine-tuned with 150 real demonstrations (Dai et al., 31 Jan 2026).
  • Large-scale real world (Astribot S1, CLAP-RF head): 61% mean success on bimanual manipulation compared to 54% baseline (Zhang et al., 7 Jan 2026).
  • Generalization: CLAP-NTP head achieves 85/80% on OOD pick-and-place and 35% (up from 0%) on unseen bouquet tasks (Zhang et al., 7 Jan 2026).
  • Ablations: Removing contrastive losses or human videos significantly degrades OOD and average performance (āˆ’15% and āˆ’11.3%, respectively) (Zhang et al., 7 Jan 2026).
  • Data scaling: Performance increases with more human video, e.g., ConLA at 10/50/100% data achieves 58.3/60.4/64.6% (Dai et al., 31 Jan 2026).

Audio

  • Zero-shot emotion recognition: ParaCLAP outperforms generic CLAP and Pengi baselines across multiple datasets (e.g., +21 points UAR on IEMOCAP) when trained with proper emotion queries (Jing et al., 2024).
  • Generalization to mixed/cross-domain tasks: More diverse queries boost performance on tasks with mismatched label sets (e.g., 5-class FAU-Aibo) (Jing et al., 2024).

A sample of results is summarized in the table below:

Task/System Baseline UAR ParaCLAP UAR
IEMOCAP (4-way) 0.353–0.345 0.567
TESS (7-way) 0.177–0.232 0.484
FAU-Aibo (2-way) 0.470–0.500 0.604

6. Limitations and Open Challenges

CLAP, while representing a major advance for label-efficient, cross-modal generalization, has several limitations:

  • Robotic grounding: Some real robot data is always necessary for final action mapping; purely video-to-actuator pipelines remain open (Zhang et al., 7 Jan 2026).
  • Pseudo-label reliance: Visual CLAP approaches require coarse action pseudo-labels, currently extracted with hand-tuned templates (Dai et al., 31 Jan 2026).
  • Semantic ambiguities: Mapping dexterous human hand movements to robot effectors is fundamentally ambiguous, especially for highly dexterous or non-anthropomorphic robots (Zhang et al., 7 Jan 2026).
  • Pipeline complexity: Most described architectures are multi-stage; fully end-to-end differentiable CLAP remains an engineering challenge (Zhang et al., 7 Jan 2026).
  • Data diversity: Both vision and audio CLAP models can be limited by training set biases, annotation sparsity, and lack of open-vocabulary coverage in paralinguistic descriptors (Jing et al., 2024).

7. Broader Impact and Future Directions

CLAP represents a paradigm shift in leveraging unannotated or weakly annotated internet-scale data for embodied policy learning and fine-grained audio understanding. In robotics, this approach enables state-of-the-art generalist policies with drastically reduced reliance on robot-specific demonstrations, leveraging human priors at scale. In paralinguistics, the method enables zero-shot open-ended language-based labeling, subject to continued advances in query generation and cross-lingual robustness.

Emerging research directions for CLAP include expanding grounding signals (force, touch), tighter integration with LLMs for both query synthesis and embodiment transfer, and building larger, more diverse paired datasets in both video and audio domains (Dai et al., 31 Jan 2026, Zhang et al., 7 Jan 2026, Jing et al., 2024).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Contrastive Latent Action Pretraining (CLAP).