Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CodeDiffuser: Attention-Enhanced Diffusion Policy via VLM-Generated Code for Instruction Ambiguity (2506.16652v1)

Published 19 Jun 2025 in cs.RO, cs.CV, cs.LG, and cs.SE

Abstract: Natural language instructions for robotic manipulation tasks often exhibit ambiguity and vagueness. For instance, the instruction "Hang a mug on the mug tree" may involve multiple valid actions if there are several mugs and branches to choose from. Existing language-conditioned policies typically rely on end-to-end models that jointly handle high-level semantic understanding and low-level action generation, which can result in suboptimal performance due to their lack of modularity and interpretability. To address these challenges, we introduce a novel robotic manipulation framework that can accomplish tasks specified by potentially ambiguous natural language. This framework employs a Vision-LLM (VLM) to interpret abstract concepts in natural language instructions and generates task-specific code - an interpretable and executable intermediate representation. The generated code interfaces with the perception module to produce 3D attention maps that highlight task-relevant regions by integrating spatial and semantic information, effectively resolving ambiguities in instructions. Through extensive experiments, we identify key limitations of current imitation learning methods, such as poor adaptation to language and environmental variations. We show that our approach excels across challenging manipulation tasks involving language ambiguity, contact-rich manipulation, and multi-object interactions.

Summary

  • The paper presents a modular system that decouples semantic understanding from action prediction using VLM-generated code to address ambiguous robotic instructions.
  • It demonstrates that attention-conditioned diffusion policies achieve over 90% success in multi-object, contact-rich scenarios compared to end-to-end approaches.
  • The framework leverages interpretable 3D attention maps from perception APIs, enabling scalable and data-efficient robotic manipulation under uncertainty.

CodeDiffuser: Attention-Enhanced Diffusion Policy via VLM-Generated Code for Instruction Ambiguity

The CodeDiffuser framework addresses a persistent issue in language-conditioned robotic manipulation: the challenge posed by ambiguous or under-specified natural language task instructions. The work presents a modular system that leverages vision-LLMs (VLMs) to generate interpretable, executable code, which interfaces with perception APIs to produce 3D attention maps. These maps amplify the efficacy of a downstream attention-enhanced diffusion policy in resolving task ambiguities during robotic manipulation, especially in multi-object, contact-rich, and instruction-ambiguous environments.

Technical Contributions

The central innovation is the decoupling of high-level semantic understanding from low-level action prediction through explicit, code-generated attention. The approach is both modular and interpretable: VLMs translate input instructions and visual observations into code, which, when executed using a suite of perception APIs, outputs 3D representations of task-relevant areas. This separation allows for systematic reasoning about ambiguous directives without the collapse in performance typically observed in fully end-to-end policies as ambiguity increases.

Core system components:

  • Code Generation with VLMs: Few-shot VLM prompting is used to translate instructions and observations into API-invoking code. The APIs expose functions for object detection, semantic name resolution, and spatial relation reasoning within 3D point clouds, utilizing DINOv2 and SAM for feature extraction and segmentation.
  • 3D Attention Map Computation: The generated code calls perception APIs that yield 3D attention maps; these maps flag relevant spatial regions or object instances per task context by leveraging fused semantic features across multiple views.
  • Attention-conditioned Diffusion Policy: The low-level visuomotor control utilizes a DDPM trained on both 3D point clouds and their associated attention maps to output continuous 6D trajectories. PointNet++ serves as the architectural backbone for fusing geometry and attention features.

Empirical Analysis

Extensive experimentation rendered several key empirical findings:

  • Baseline Limitation Clarity: Simulated and real-world benchmarks revealed that state-of-the-art imitation learning algorithms—both ACT and several variants of Diffusion Policy—experience substantial success rate declines as instruction ambiguity or multi-modality increases. Merely scaling the number of demonstrations proved insufficient to overcome this barrier.
  • Attention Map Validity: The VLM-generated 3D attention maps showed robust alignment with both explicit and ambiguous task instructions across diverse tasks. Quantitative evaluation in benchmarks showed >95% success in matching ground-truth task-relevant regions.
  • Policy Conditioning Robustness: Training diffusion policies on attention-conditioned state representations enabled high success rates (>90%) even in highly ambiguous or unseen scenarios, both in simulation and the real world. The inclusion of 3D (as opposed to 2D) attention and a residual connection in PointNet++ further improved performance.
  • System-wide Impact: Pipeline experiments, incorporating language input to action output, demonstrated a substantial performance margin over language-conditioned end-to-end baselines (e.g., 86.5% vs. 5.5% in ambiguous simulation tasks). The use of VLM-generated code as an explicit intermediate was critical for this advancement.

A detailed failure mode analysis indicated that remaining shortcomings are predominantly concentrated in the execution phase rather than in code generation or perception, supporting the stability and reliability of the modular API/code approach.

Implications

Practical and Theoretical

On the practical front, CodeDiffuser enables language-directed robotic systems to robustly handle ambiguity without scaling demonstration data or requiring laborious manual engineering for multimodal decision branches. The API-based intermediate provides fine-grained interpretability and extensibility: developers can readily augment, debug, or audit the VLM-generated actionable logic or customize perception APIs for additional domains.

Theoretically, the work supports a modularist position in robot policy design: explicit task semantics, when mapped by VLM-generated code to structured spatial attention, yield more data-efficient and resilient control under task ambiguity compared to monolithic end-to-end models. The delineation of high-level semantic parsing and low-level control provides a template for future architectures addressing multi-modal or under-specified input domains.

Future Directions

Key avenues for development include:

  • Upgrading to More Capable VLMs/VFMs: System performance is currently upper-bounded by foundation model capabilities; as VLMs and VFMs advance, so too will both the reliability and complexity of instruction-following achievable with minimal engineering.
  • Expanding Beyond Object-level APIs: Current APIs struggle to generalize to manipulation of deformable or amorphous regions; developing perception APIs for such contexts is essential for broader deployment.
  • Reducing Annotation Overhead: Manual annotation of object reference features remains a chokepoint for large-scale deployment. Automating or crowdsourcing this process could enable wider applicability across domains or task families.

Conclusion

CodeDiffuser marks a significant advance in modular, VLM-driven robotic instruction following. Its grounding in interpretable, executable intermediate representations enables robust handling of instruction ambiguity—a property not achieved by prior end-to-end language-conditioned policies. While dependence on foundation model maturity and annotation requirements remain open challenges, the architecture lays groundwork for scalable, transparent, and flexible language-driven robotics, suggesting fertile ground for future research at the intersection of code generation, visual grounding, and multimodal embodied policy learning.