- The paper presents a modular system that decouples semantic understanding from action prediction using VLM-generated code to address ambiguous robotic instructions.
- It demonstrates that attention-conditioned diffusion policies achieve over 90% success in multi-object, contact-rich scenarios compared to end-to-end approaches.
- The framework leverages interpretable 3D attention maps from perception APIs, enabling scalable and data-efficient robotic manipulation under uncertainty.
CodeDiffuser: Attention-Enhanced Diffusion Policy via VLM-Generated Code for Instruction Ambiguity
The CodeDiffuser framework addresses a persistent issue in language-conditioned robotic manipulation: the challenge posed by ambiguous or under-specified natural language task instructions. The work presents a modular system that leverages vision-LLMs (VLMs) to generate interpretable, executable code, which interfaces with perception APIs to produce 3D attention maps. These maps amplify the efficacy of a downstream attention-enhanced diffusion policy in resolving task ambiguities during robotic manipulation, especially in multi-object, contact-rich, and instruction-ambiguous environments.
Technical Contributions
The central innovation is the decoupling of high-level semantic understanding from low-level action prediction through explicit, code-generated attention. The approach is both modular and interpretable: VLMs translate input instructions and visual observations into code, which, when executed using a suite of perception APIs, outputs 3D representations of task-relevant areas. This separation allows for systematic reasoning about ambiguous directives without the collapse in performance typically observed in fully end-to-end policies as ambiguity increases.
Core system components:
- Code Generation with VLMs: Few-shot VLM prompting is used to translate instructions and observations into API-invoking code. The APIs expose functions for object detection, semantic name resolution, and spatial relation reasoning within 3D point clouds, utilizing DINOv2 and SAM for feature extraction and segmentation.
- 3D Attention Map Computation: The generated code calls perception APIs that yield 3D attention maps; these maps flag relevant spatial regions or object instances per task context by leveraging fused semantic features across multiple views.
- Attention-conditioned Diffusion Policy: The low-level visuomotor control utilizes a DDPM trained on both 3D point clouds and their associated attention maps to output continuous 6D trajectories. PointNet++ serves as the architectural backbone for fusing geometry and attention features.
Empirical Analysis
Extensive experimentation rendered several key empirical findings:
- Baseline Limitation Clarity: Simulated and real-world benchmarks revealed that state-of-the-art imitation learning algorithms—both ACT and several variants of Diffusion Policy—experience substantial success rate declines as instruction ambiguity or multi-modality increases. Merely scaling the number of demonstrations proved insufficient to overcome this barrier.
- Attention Map Validity: The VLM-generated 3D attention maps showed robust alignment with both explicit and ambiguous task instructions across diverse tasks. Quantitative evaluation in benchmarks showed >95% success in matching ground-truth task-relevant regions.
- Policy Conditioning Robustness: Training diffusion policies on attention-conditioned state representations enabled high success rates (>90%) even in highly ambiguous or unseen scenarios, both in simulation and the real world. The inclusion of 3D (as opposed to 2D) attention and a residual connection in PointNet++ further improved performance.
- System-wide Impact: Pipeline experiments, incorporating language input to action output, demonstrated a substantial performance margin over language-conditioned end-to-end baselines (e.g., 86.5% vs. 5.5% in ambiguous simulation tasks). The use of VLM-generated code as an explicit intermediate was critical for this advancement.
A detailed failure mode analysis indicated that remaining shortcomings are predominantly concentrated in the execution phase rather than in code generation or perception, supporting the stability and reliability of the modular API/code approach.
Implications
Practical and Theoretical
On the practical front, CodeDiffuser enables language-directed robotic systems to robustly handle ambiguity without scaling demonstration data or requiring laborious manual engineering for multimodal decision branches. The API-based intermediate provides fine-grained interpretability and extensibility: developers can readily augment, debug, or audit the VLM-generated actionable logic or customize perception APIs for additional domains.
Theoretically, the work supports a modularist position in robot policy design: explicit task semantics, when mapped by VLM-generated code to structured spatial attention, yield more data-efficient and resilient control under task ambiguity compared to monolithic end-to-end models. The delineation of high-level semantic parsing and low-level control provides a template for future architectures addressing multi-modal or under-specified input domains.
Future Directions
Key avenues for development include:
- Upgrading to More Capable VLMs/VFMs: System performance is currently upper-bounded by foundation model capabilities; as VLMs and VFMs advance, so too will both the reliability and complexity of instruction-following achievable with minimal engineering.
- Expanding Beyond Object-level APIs: Current APIs struggle to generalize to manipulation of deformable or amorphous regions; developing perception APIs for such contexts is essential for broader deployment.
- Reducing Annotation Overhead: Manual annotation of object reference features remains a chokepoint for large-scale deployment. Automating or crowdsourcing this process could enable wider applicability across domains or task families.
Conclusion
CodeDiffuser marks a significant advance in modular, VLM-driven robotic instruction following. Its grounding in interpretable, executable intermediate representations enables robust handling of instruction ambiguity—a property not achieved by prior end-to-end language-conditioned policies. While dependence on foundation model maturity and annotation requirements remain open challenges, the architecture lays groundwork for scalable, transparent, and flexible language-driven robotics, suggesting fertile ground for future research at the intersection of code generation, visual grounding, and multimodal embodied policy learning.