AnchorDP3: Dual-Arm Robotic Manipulation
- AnchorDP3 is a dual-arm robotic manipulation framework that leverages simulator-supervised affordance segmentation, task-conditioned lightweight encoders, and a sparse keypose diffusion policy to achieve state-of-the-art generalization.
- It employs a streamlined perception pipeline and keypose planning to reduce the action prediction space, achieving success rates up to 99% in procedurally randomized 3D environments.
- Empirical benchmarks and ablation studies validate its innovations, demonstrating average success rates of 98.7% and highlighting the benefits of modular task isolation and simulator-guided supervision.
AnchorDP3 is a dual-arm robotic manipulation policy framework that achieves high task and domain generalization through a combination of simulator-supervised affordance segmentation, task-conditioned lightweight encoders, and a sparse, keypose-based diffusion policy for sequential action prediction. Its design is specialized for procedurally randomized 3D environments and multi-task bimanual settings, as demonstrated in the RoboTwin Dual-Arm Collaboration Challenge, where it sets state-of-the-art performance benchmarks in robust, high-speed dual-arm manipulation without human demonstration data (Zhao et al., 24 Jun 2025, Chen et al., 29 Jun 2025).
1. Core Architecture and Innovations
AnchorDP3 comprises three technical innovations, each directly contributing to generalization and sample efficiency in complex manipulation scenarios:
- Simulator-Supervised Semantic Segmentation: AnchorDP3 leverages the full knowledge of scene layouts provided by simulation to explicitly segment task-critical objects within point clouds. Simulator-rendered ground-truth depth maps are differenced to yield pixel-accurate binary affordance masks, which are projected onto the 3D point cloud. These masks are input to a lightweight PointNet++-style segmentation network, producing a per-point binary affordance label appended to each point's features.
- Task-Conditioned Feature Encoders: For each manipulation task, a lightweight encoder (~0.28M parameters) processes the augmented point cloud into a compact fixed-length embedding. Task selection is performed by a language classifier mapping the current instruction to logits , selecting one of eight encoders . Each encoder applies per-point MLPs, max-pooling, and further projection to a 192-dimensional task embedding . No feature or parameter sharing occurs across tasks at this stage, thus isolating task representations and mitigating negative multi-task transfer.
- Affordance-Anchored Keypose Diffusion Policy: Instead of regressing dense action streams, AnchorDP3 predicts a sparse sequence of geometry-meaningful action anchors, or "keyposes," corresponding to manipulation inflection points (pre-grasp, grasp open, grasp closed, pre-place, etc.). A conditional 1D U-Net diffusion model, modulated by the task embedding via FiLM, predicts future anchors , each encoding joint positions, gripper states, and end-effector poses. Only the first anchor in the sequence is executed; the remaining provide supervision during training.
2. Perception and Observation Pipeline
The perception stage ingests four-view RGB-D images, projects these into a 3D point cloud with geometric feature augmentation (normals, curvature), and employs Farthest Point Sampling (FPS) to reduce input dimensionality to points. The segmentation network (PointNet++ with shared MLP widths ) consumes the sampled cloud and outputs affordance masks via per-point binary classification. Ground-truth object IDs are used to render occluded voxels, and the mask is computed as , thresholded for binary labeling.
Task-conditioning proceeds by passing the affordance-enhanced cloud to the correctly indexed encoder , producing embedding that parameterizes the keypose diffusion model.
3. Keypose Diffusion Policy Learning
The action prediction head is a conditional 1D U-Net, taking context features , timestep , and noised anchors as input. Diffusion operates over the action anchors with a forward process , and the reverse process is parameterized as . The model learns to estimate the Gaussian noise using the standard denoising loss:
Full-state supervision is enforced by an additional loss over all predicted anchors:
The total training objective is with .
By planning and executing only at sparse, task-relevant keyposes, this approach drastically reduces the prediction and search space compared to dense 20–25Hz action streams, and aligns with human-inspired planning at motion inflection points (Zhao et al., 24 Jun 2025).
4. Training Regime and Dataset Generation
AnchorDP3 is trained end-to-end on 1 million procedurally generated manipulation episodes. The dataset generation uses two phases:
- Rollout Phase: The expert oracle is run to generate ground-truth keypose time indices.
- Render Phase: For each anchor interval, two random views and future anchor labels are rendered.
The procedure exploits extreme domain randomization (objects, lighting, camera poses, table height, and backgrounds) for robustness. DAgger-style off-trajectory samples (10%) are interleaved to foster recovery behaviors. Each trajectory is represented by only a small number of keyframes, boosting diversity per storage unit by 14.5× relative to dense sampling.
For sim-to-real deployment, the RoboTwin real-to-sim pipeline calibrates camera parameters and randomizes simulation textures, while fine-tuning on small real-world datasets (<5k frames) is sufficient for zero-shot transfer (Zhao et al., 24 Jun 2025, Chen et al., 29 Jun 2025).
5. Empirical Results and Ablations
AnchorDP3 shows state-of-the-art performance in both the RoboTwin Challenge and controlled benchmarks:
| Task | AnchorDP3 | DP3 | equivariantDP | RT-1 |
|---|---|---|---|---|
| pick_place_block | 99.2 | 94.5 | 91.7 | 88.3 |
| place_mouse | 98.9 | 93.4 | 90.1 | 85.2 |
| place_stapler | 98.3 | 92.1 | 89.5 | 83.7 |
| place_bell | 99.1 | 93.8 | 90.8 | 86.0 |
| tool_use | 97.8 | 91.2 | 88.0 | 82.9 |
| assembly | 98.5 | 92.9 | 89.9 | 84.1 |
| average | 98.7 | 92.1 | 89.3 | 85.0 |
Ablation studies confirm the contribution of each architectural choice. Removal of segmentation reduces average success to 94.5%, absence of task conditioners drops it to 96.2%, and replacing sparse keypose diffusion with dense prediction degrades performance to 93.5%. The design, combining affordance supervision, isolated task encoders, and keypose diffusion, is empirically substantiated as critical for robust policy learning in highly randomized environments (Zhao et al., 24 Jun 2025, Chen et al., 29 Jun 2025).
6. Application Scope, Limitations, and Future Directions
AnchorDP3 is validated across 12 dual-arm simulation tasks ranging from rigid-body placement and stacking to visuo-tactile deformable sorting. Success rates average 98.7% under heavy visual and physical randomization. It achieves robust dual-arm coordination by synchronizing both arms at each keypose, with subsequent keyposes re-predicted as needed for drift correction.
Notable limitations include reliance on simulator-derived affordance masks (necessitating future work in real-world affordance prediction), lack of tactile feedback integration at inference, and reduced applicability in highly dynamic (non-quasi-static) tasks such as cloth folding. Future avenues include real-time affordance learning, end-to-end tactile-to-action policy optimization, hierarchical chaining of keypose planners for long-horizon tasks, and integration with large-scale vision-language-action models for zero-shot generalization from language instructions (Zhao et al., 24 Jun 2025, Chen et al., 29 Jun 2025).
7. Significance in Robotic Manipulation Research
AnchorDP3 demonstrates that sparse, affordance-anchored action planning, combined with diffusion modeling and modular task conditioning, enables high performance in generalizable robotic policies for dual-arm systems. Its principled use of simulator supervision for segmentation, architectural isolation of task features, and focus on key geometric action anchors represents a validated paradigm for scaling robotic manipulation to diverse, cluttered, and unstructured environments without reliance on dense human demonstration data (Zhao et al., 24 Jun 2025, Chen et al., 29 Jun 2025).