Papers
Topics
Authors
Recent
Search
2000 character limit reached

AnchorDP3: Dual-Arm Robotic Manipulation

Updated 18 March 2026
  • AnchorDP3 is a dual-arm robotic manipulation framework that leverages simulator-supervised affordance segmentation, task-conditioned lightweight encoders, and a sparse keypose diffusion policy to achieve state-of-the-art generalization.
  • It employs a streamlined perception pipeline and keypose planning to reduce the action prediction space, achieving success rates up to 99% in procedurally randomized 3D environments.
  • Empirical benchmarks and ablation studies validate its innovations, demonstrating average success rates of 98.7% and highlighting the benefits of modular task isolation and simulator-guided supervision.

AnchorDP3 is a dual-arm robotic manipulation policy framework that achieves high task and domain generalization through a combination of simulator-supervised affordance segmentation, task-conditioned lightweight encoders, and a sparse, keypose-based diffusion policy for sequential action prediction. Its design is specialized for procedurally randomized 3D environments and multi-task bimanual settings, as demonstrated in the RoboTwin Dual-Arm Collaboration Challenge, where it sets state-of-the-art performance benchmarks in robust, high-speed dual-arm manipulation without human demonstration data (Zhao et al., 24 Jun 2025, Chen et al., 29 Jun 2025).

1. Core Architecture and Innovations

AnchorDP3 comprises three technical innovations, each directly contributing to generalization and sample efficiency in complex manipulation scenarios:

  1. Simulator-Supervised Semantic Segmentation: AnchorDP3 leverages the full knowledge of scene layouts provided by simulation to explicitly segment task-critical objects within point clouds. Simulator-rendered ground-truth depth maps are differenced to yield pixel-accurate binary affordance masks, which are projected onto the 3D point cloud. These masks are input to a lightweight PointNet++-style segmentation network, producing a per-point binary affordance label appended to each point's features.
  2. Task-Conditioned Feature Encoders: For each manipulation task, a lightweight encoder (~0.28M parameters) processes the augmented point cloud into a compact fixed-length embedding. Task selection is performed by a language classifier mapping the current instruction to logits l∈R8l \in \mathbb{R}^8, selecting one of eight encoders EkE_k. Each encoder applies per-point MLPs, max-pooling, and further projection to a 192-dimensional task embedding fkf_k. No feature or parameter sharing occurs across tasks at this stage, thus isolating task representations and mitigating negative multi-task transfer.
  3. Affordance-Anchored Keypose Diffusion Policy: Instead of regressing dense action streams, AnchorDP3 predicts a sparse sequence of geometry-meaningful action anchors, or "keyposes," corresponding to manipulation inflection points (pre-grasp, grasp open, grasp closed, pre-place, etc.). A conditional 1D U-Net diffusion model, modulated by the task embedding via FiLM, predicts H=8H=8 future anchors aAk+i∈R32a_{A_{k+i}} \in \mathbb{R}^{32}, each encoding joint positions, gripper states, and end-effector poses. Only the first anchor in the sequence is executed; the remaining H−1H-1 provide supervision during training.

2. Perception and Observation Pipeline

The perception stage ingests four-view RGB-D images, projects these into a 3D point cloud with geometric feature augmentation (normals, curvature), and employs Farthest Point Sampling (FPS) to reduce input dimensionality to N=4096N=4096 points. The segmentation network (PointNet++ with shared MLP widths [11→64→128→256][11 \rightarrow 64 \rightarrow 128 \rightarrow 256]) consumes the sampled cloud and outputs affordance masks via per-point binary classification. Ground-truth object IDs are used to render occluded voxels, and the mask is computed as ΔD(u,v)=Dfull(u,v)−Doccluded(u,v)\Delta D(u,v) = D_\text{full}(u,v) - D_\text{occluded}(u,v), thresholded for binary labeling.

Task-conditioning proceeds by passing the affordance-enhanced cloud to the correctly indexed encoder EkE_k, producing embedding fkf_k that parameterizes the keypose diffusion model.

3. Keypose Diffusion Policy Learning

The action prediction head is a conditional 1D U-Net, taking context features fkf_k, timestep tt, and noised anchors xtx_t as input. Diffusion operates over the action anchors with a forward process q(xt∣xt−1)=N(xt;αtxt−1,(1−αt)I)q(x_t|x_{t-1}) = \mathcal{N}(x_t; \sqrt{\alpha_t}x_{t-1}, (1-\alpha_t)I), and the reverse process is parameterized as pθ(xt−1∣xt)=N(xt−1;μθ(xt,t),Σθ(xt,t))p_\theta(x_{t-1}|x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t)). The model learns to estimate the Gaussian noise ϵ\epsilon using the standard denoising loss:

Ldiff=Ex0,ϵ,t ∥ϵ−ϵθ(αˉtx0+1−αˉtϵ,t∣fk)∥2L_\text{diff} = \mathbb{E}_{x_0, \epsilon, t} \ \|\epsilon - \epsilon_\theta( \sqrt{ \bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon , t | f_k ) \|^2

Full-state supervision is enforced by an additional loss over all predicted anchors:

Lstate=1H∑i=1H∥aAk+ipred−aAk+iGT∥2L_\text{state} = \frac{1}{H} \sum_{i=1}^H \| a_{A_{k+i}}^\text{pred} - a_{A_{k+i}}^\text{GT} \|^2

The total training objective is L=Lseg+Ldiff+λstateLstateL = L_\text{seg} + L_\text{diff} + \lambda_\text{state} L_\text{state} with λstate=1\lambda_\text{state}=1.

By planning and executing only at sparse, task-relevant keyposes, this approach drastically reduces the prediction and search space compared to dense 20–25Hz action streams, and aligns with human-inspired planning at motion inflection points (Zhao et al., 24 Jun 2025).

4. Training Regime and Dataset Generation

AnchorDP3 is trained end-to-end on 1 million procedurally generated manipulation episodes. The dataset generation uses two phases:

  • Rollout Phase: The expert oracle is run to generate ground-truth keypose time indices.
  • Render Phase: For each anchor interval, two random views and future anchor labels are rendered.

The procedure exploits extreme domain randomization (objects, lighting, camera poses, table height, and backgrounds) for robustness. DAgger-style off-trajectory samples (10%) are interleaved to foster recovery behaviors. Each trajectory is represented by only a small number of keyframes, boosting diversity per storage unit by 14.5× relative to dense sampling.

For sim-to-real deployment, the RoboTwin real-to-sim pipeline calibrates camera parameters and randomizes simulation textures, while fine-tuning on small real-world datasets (<5k frames) is sufficient for zero-shot transfer (Zhao et al., 24 Jun 2025, Chen et al., 29 Jun 2025).

5. Empirical Results and Ablations

AnchorDP3 shows state-of-the-art performance in both the RoboTwin Challenge and controlled benchmarks:

Task AnchorDP3 DP3 equivariantDP RT-1
pick_place_block 99.2 94.5 91.7 88.3
place_mouse 98.9 93.4 90.1 85.2
place_stapler 98.3 92.1 89.5 83.7
place_bell 99.1 93.8 90.8 86.0
tool_use 97.8 91.2 88.0 82.9
assembly 98.5 92.9 89.9 84.1
average 98.7 92.1 89.3 85.0

Ablation studies confirm the contribution of each architectural choice. Removal of segmentation reduces average success to 94.5%, absence of task conditioners drops it to 96.2%, and replacing sparse keypose diffusion with dense prediction degrades performance to 93.5%. The design, combining affordance supervision, isolated task encoders, and keypose diffusion, is empirically substantiated as critical for robust policy learning in highly randomized environments (Zhao et al., 24 Jun 2025, Chen et al., 29 Jun 2025).

6. Application Scope, Limitations, and Future Directions

AnchorDP3 is validated across 12 dual-arm simulation tasks ranging from rigid-body placement and stacking to visuo-tactile deformable sorting. Success rates average 98.7% under heavy visual and physical randomization. It achieves robust dual-arm coordination by synchronizing both arms at each keypose, with subsequent keyposes re-predicted as needed for drift correction.

Notable limitations include reliance on simulator-derived affordance masks (necessitating future work in real-world affordance prediction), lack of tactile feedback integration at inference, and reduced applicability in highly dynamic (non-quasi-static) tasks such as cloth folding. Future avenues include real-time affordance learning, end-to-end tactile-to-action policy optimization, hierarchical chaining of keypose planners for long-horizon tasks, and integration with large-scale vision-language-action models for zero-shot generalization from language instructions (Zhao et al., 24 Jun 2025, Chen et al., 29 Jun 2025).

7. Significance in Robotic Manipulation Research

AnchorDP3 demonstrates that sparse, affordance-anchored action planning, combined with diffusion modeling and modular task conditioning, enables high performance in generalizable robotic policies for dual-arm systems. Its principled use of simulator supervision for segmentation, architectural isolation of task features, and focus on key geometric action anchors represents a validated paradigm for scaling robotic manipulation to diverse, cluttered, and unstructured environments without reliance on dense human demonstration data (Zhao et al., 24 Jun 2025, Chen et al., 29 Jun 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AnchorDP3.