CLAMP: Contrastive 3D Multi-View Robotic Pretraining

Updated 7 February 2026

The paper demonstrates CLAMP's novel integration of 3D multi-view observations, textual context, and robot actions into a unified contrastive pretraining framework.
It leverages multi-modal triplets with specialized encoders—including a 3D-aware ViT—to achieve significant improvements over traditional 2D methods.
CLAMP’s pipeline, enhanced by diffusion policy pretraining and targeted fine-tuning, results in marked increases in both simulated and real-world robotic manipulation success rates.

Contrastive Learning for 3D Multi-View Action-Conditioned Robotic Manipulation Pretraining (CLAMP) is a pre-training framework designed to address the limitations of 2D visual pretraining in robotic manipulation by explicitly incorporating 3D geometry, multi-view observations, and robot actions into a unified contrastive learning objective. CLAMP leverages multi-modal, multi-view data and contrastive objectives to learn joint representations that are highly aligned with the requirements of complex, precise manipulation tasks in both simulated and real-world settings (Liu et al., 31 Jan 2026).

1. Problem Setup and Foundations

The core objective of CLAMP is to pre-train encoders that map multi-view 3D observations, textual task context, and robot action histories to a shared embedding space, optimizing for high fidelity in geometric and action-relevant properties crucial for manipulation. This contrasts with prior 2D representation learning, which struggles to encode spatial relationships and fine positional cues essential for high-precision tasks.

CLAMP operates on large datasets of expert demonstrations formulated as triplets $\{(I_i, T_i, A_i)\}_{i=1}^N$ , where:

$I \in \mathbb{R}^{H_I \times W_I \times 4}$ : horizontally stacked, four-channel re-rendered views (depth + global $X, Y, Z$ coordinates).
$T$ : tokenized text string encoding high-level task description, per-object names and normalized positions, and a discrete task-progress indicator.
$A$ : windowed sequence of $H$ recent joint-space actions.

Three distinct encoders, $f(I)$ , $g(T)$ , $h(A)$ , each project their respective modalities into a 768-D $\ell_2$ -normalized embedding. The CLAMP contrastive objective constrains pairs and triplets of these modalities to co-locate in embedding space if they originate from the same world-state and diverge otherwise (Liu et al., 31 Jan 2026).

2. Multi-View 3D Re-Rendering and Observation Encoding

The multi-view re-rendering module creates rich 3D composite observations suitable for contrastive pretraining:

For each expert episode, RGB-D images with intrinsic and extrinsic calibration are collected from physical (or simulated) cameras.
All images are back-projected to a workspace-cropped point cloud. Points from all cameras are merged, then voxelized for efficiency (voxel size 0.001 m, cap at 0.3 million points).
Five virtual cameras are synthesized: three fixed (overhead, front-left, back-right) and two dynamic wrist-mounted (at each gripper pose with a small offset).
For each virtual view and for each projected pixel $u_v$ , four channels are output: $D(u)$ (camera-frame depth), and $X(u),Y(u),Z(u)$ (global coordinates of the observed point).
Views are horizontally stacked to a tensor of shape $224 \times 1120 \times 4$ per timestep.

This process exposes the encoder to consistent geometric correspondences across viewpoints, with wrist views providing target-aligned perspectives that are highly beneficial for high-precision manipulation (Liu et al., 31 Jan 2026).

CLAMP trains modality encoders using a SigLIP-style, pairwise contrastive objective, optimized jointly across all three modalities:

For modalities image ( $x_i = f(I_i)$ ), action ( $z_i = h(A_i)$ ), and text ( $y_i = g(T_i)$ ), and batch size $B$ :
For each positive pair (e.g., $(x_i,z_i)$ ), the objective encourages similarity; all other $(x_i,z_k)$ , $k \neq i$ are treated as negatives.

The loss for image–action is:

$L_{ImageAction} = -\frac{1}{B}\sum_{i=1}^B \sum_{k=1}^B \log \sigma[\ell_{ik}(t x_i^\top z_k - b)]$

with $\sigma(s)=1/(1+e^{-s})$ , $\ell_{ik}=+1$ if $i=k$ , $-1$ otherwise; $t$ and $b$ are trainable scale and bias parameters. The overall loss is the average of image–text, image–action, and text–action losses.

Three major architectural choices underpin this stage:

Image encoder: ViT-B/16 with 3D-aware positional encoding (STRING RPE).
Text encoder: CLIP-style Transformer.
Action encoder: Transformer over joint-space histories. Contrastive pre-training is conducted at scale (batch size 2048, 100K steps, 64 TPUv4), enabling embedding space generalization across a diverse set of trajectories and workspace configurations (Liu et al., 31 Jan 2026).

4. Diffusion Policy Pretraining and Integration

In contrast to methods that isolate representation and policy learning, CLAMP pre-trains a policy via a diffusion model (DDPM) in parallel with encoder contrastive training. The policy network conditions on:

Frozen outputs of the learned CLAMP encoders for images and actions.
ResNet-50 feature maps for each camera.
Proprioceptive features.

The denoising score matching objective is:

$L_{DDPM} = \mathbb{E}_{k, \epsilon^k \sim \mathcal{N}(0, I)} \| \epsilon^k - \epsilon_\theta(O_t, a^k, k) \|^2$

The DDPM outputs the next horizon of $H=50$ joint targets. During pre-training, the policy is exposed to the full diversity of the demonstration buffer, resulting in an initialization that demonstrates strong policy transfer and rapid adaptation during fine-tuning (Liu et al., 31 Jan 2026).

5. Fine-Tuning and Downstream Robotic Manipulation

Following pre-training, the full policy—comprising the ResNets, modality encoders, and diffusion Transformer—is fine-tuned on a limited number of environment/task demonstrations. The encoder weights are frozen to maintain invariances learned during contrastive pre-training. Fine-tuning adjusts the policy Transformer for specific tasks via the same $L_{DDPM}$ objective.

Downstream tasks in both simulation and the real world include:

Six simulated manipulation tasks (e.g., Can Opener in Caddy, Drawer Open) with two RGB-D camera views plus proprioception.
Five real-world tasks (e.g., Open Drawer, Plate on Rack, Recycle Cans) using four RGB-D cameras (two static, two wrist-mounted).

Evaluation involves repeated trial-based success measurement after each policy checkpoint. CLAMP addition yields substantial increases in both simulated and real-world full task success rates compared to ablations and prior baselines (Liu et al., 31 Jan 2026).

Method	Sim Mean Success (%)	Real (full success / 10)
ALOHA Unleashed	54.0–80.7	0–8
+ CLAMP	92.7–98.0	5–8

6. Architectural Ablations and Analysis

Systematic ablation studies in CLAMP elucidate the contribution of each component:

Wrist views: Omission reduces success by up to 28.6% for Mug-on-Plate.
Text encoder removal: Minor performance drop, indicating robustness.
ViT vs. point cloud encoder (DP3): ViT-B/16 outperforms DP3 by 7–15%, suggesting that 3D-aware ViTs have preferable properties for multi-view alignment.
Encoder pretraining: Training from scratch degrades performance 5–10%.
Unfreezing encoders during fine-tuning: Marked drop in success (up to 23.3%), confirming the necessity of encoder invariance retention.
16-view dome (oversubscribed view grid): Significantly reduced performance versus the standard five-view configuration (–42% for Can Opener).

Ablation results demonstrate that CLAMP’s performance gains arise from the synergy of tuned multi-view construction, joint cross-modal contrast, and tight integration with diffusion policy pretraining (Liu et al., 31 Jan 2026).

7. Comparative Context and Extensions

Earlier approaches such as CLfD (Correia et al., 2022) and CMC employed multi-view visual contrastive alignment for viewpoint invariance and RL reward shaping, but lacked integration of 3D geometric signals, explicit action-conditioning, and diffusion policy coupling. CLAMP advances the paradigm by:

Introducing explicit 3D geometry to the representation (multi-channel global maps).
Conditioning the contrastive loss jointly on action and text modalities.
Employing dynamic, task-centric wrist-mounted and virtual cameras. Future directions identified include more tightly integrating action trajectories, extending 3D fusion (e.g., unifying point clouds at inference), or temporally structuring positive pairs for encoding dynamics. This suggests a converging trend toward unified multi-modal, geometry-aware, and action-conditioned pretraining as the backbone for ever more capable robotic manipulation systems (Liu et al., 31 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (2)

CLAMP: Contrastive Learning for 3D Multi-View Action-Conditioned Robotic Manipulation Pretraining (2026)

Contrastive Learning from Demonstrations (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Contrastive Learning for 3D Multi-View Action-Conditioned Robotic Manipulation Pretraining (CLAMP).