GraspMAS: Zero-Shot Multi-Agent Grasping
- GraspMAS is a zero-shot multi-agent framework that uses Planner, Coder, and Observer to interpret natural language commands and predict grasp poses.
- It employs a closed-loop protocol with LLMs and visual-linguistic tools to iteratively refine code and feedback for robust grasp detection.
- Benchmark evaluations show GraspMAS outperforms existing methods by delivering higher success rates and greater adaptability in complex, dense environments.
GraspMAS is a zero-shot, multi-agent system framework for language-driven grasp detection, enabling robotic manipulators to interpret free-form natural language commands and execute targeted grasps in real-world, densely cluttered visual environments. At its core, GraspMAS orchestrates three specialized agents—Planner, Coder, and Observer—in a closed-loop protocol, leveraging LLMs and pretrained visual-linguistic tools to achieve robust, adaptable, and training-free grasp pose prediction, formalized as a five-parameter output , given RGB image input and textual instruction (Nguyen et al., 23 Jun 2025).
1. System Architecture and Agent Roles
GraspMAS decomposes the grasp detection pipeline into three intercommunicating agents:
Planner: An LLM-based agent that receives and optional feedback , outputting a symbolic plan . The Planner resolves semantic and spatial ambiguities in , defines high-level strategies, and delegates subtasks to specialized tools.
Coder: A code-generation agent that translates the Planner’s plan and associated low-level APIs into executable Python code . The code instantiates and invokes tools from the set (e.g., find, masks, compute_depth, grasp_detection), processes , and yields intermediate artifacts .
Observer: A multi-modal LLM that critiques the Coder’s outputs. The Observer inspects , verifies grasp feasibility, workspace constraints, and detection integrity, and produces structured textual feedback (e.g., "object too far", "mask not found", "grasp collides") that is routed back to the Planner for iterative refinement.
This closed-loop design persists over one or more cycles until a convergence or termination condition, upon which the grasp pose is output.
2. Formal Problem Formulation
Given (RGB scene) and (natural-language command), the objective is to output a grasp such that:
- The manipulated object(s) match the textual referent in .
- The pose satisfies geometric grasp stability criteria, as per the detection tool.
- The solution resides within the robot’s reachable workspace.
The zero-shot setting precludes task-specific finetuning; all agent operations depend on pretrained models and prompt engineering.
Plans are formalized as sequences: where are arguments drawn from and prior results. The system performs iterative plan refinement via: with the iterative loop terminated by explicit feedback or a maximum iteration threshold .
Grasp selection seeks to maximize a zero-shot grasp score: where:
- : alignment of crop with language query,
- : output of
grasp_detection, - : binary workspace reachability.
3. Agent Algorithms and Coordination Protocol
Each agent executes a distinct role through prompt-structured interaction or code execution:
Planner: Constructs its prompt from a role description, tool API summaries, and in-context examples. Upon receiving , it requests the LLM to generate a stepwise plan via tool calls.
Coder: Builds a Python code template, populates it with instantiations of each using the planned arguments, and executes the code in a managed environment. Errors are captured as logs, which supplement the Observer’s input.
Observer: Receives visual and log outputs, assembles a multimodal prompt, and utilizes an LLM to appraise grasp validity, collision, and perceptual failures.
Coordination is governed by a loop:
1 2 3 4 5 6 7 |
h = "" for t in range(1, N+1): p = Planner(I, T, h) r, e = Coder(p, I) h = Observer(r, e) if h indicates "Solution found": break g_final = extract_grasp(r) |
4. Inference Pipeline and Zero-Shot Operation
The GraspMAS pipeline operates without any task-specific gradient-based optimization, relying solely on pre-established LLM and tool capabilities. Input images are resized to a canonical resolution (e.g., ), and language commands undergo minimal normalization.
Zero-shot inference proceeds as follows:
- Prepare the Planner’s system prompt, providing tool API docs and 2 in-context plan code examples.
- Invoke the Planner for the initial plan.
- Iterate Coder and Observer per the closed-loop protocol until convergence.
All learning in GraspMAS consists of LLM prompt engineering and tool calls; no model parameters are updated during deployment.
5. Experimental Evaluation
GraspMAS is evaluated on two large-scale benchmarks:
- GraspAnything++: 1 million images and 3 million objects with language queries; evaluations focus on the official 100,000-query test split.
- OCID-VLG: 1,763 highly cluttered tabletops with 17,700 test query–image pairs.
Success Rate (SR)—the primary metric—requires that for a predicted grasp and ground truth : where .
Comparative performance:
| Method | OCID-VLG SR | GraspAny++ SR | Inference (s) |
|---|---|---|---|
| OWLv2 + RAGT (E2E) | 0.22 | 0.24 | 0.37 |
| GroundingDINO + RAGT (E2E) | 0.27 | 0.33 | 0.32 |
| QWEN2-VL + RAGT (E2E) | 0.41 | 0.48 | 1.13 |
| OWG (Compositional) | 0.53 | 0.42 | 3.35 |
| ViperGPT (Compositional) | 0.44 | 0.57 | 1.04 |
| GraspMAS | 0.62 | 0.68 | 2.12 |
Qualitative capabilities: The framework handles complex language and visual reasoning, such as selecting the "second bottle from the left" or the object satisfying "used for cutting," leveraging Planner strategies like spatial sorting and property verification through tool calls, and Observer confirmation of affordances.
6. Ablation and Sensitivity Analyses
Ablation studies clarify the function of each agent:
- Omitting the Observer loop (Planner→Coder only) lowers OCID-VLG SR from 0.62 to 0.49.
- Disabling Planner refinement (i.e., single pass) yields SR .
- Removing Coder error feedback (allowing code execution failures) yields SR .
The system’s performance degrades as scene clutter increases (SR drops by 12% as object count increases from 3 to 15). Functional queries (e.g., "to cut," "to write") are better resolved when Planner invokes semantic LLM queries, with SR improving by +7% when this tool is enabled (Nguyen et al., 23 Jun 2025).
7. Context and Implications
GraspMAS demonstrates that zero-shot, closed-loop multi-agent coordination leveraging LLM planning, code generation, multimodal feedback, and pretrained visual-language tool APIs achieves state-of-the-art results on both compositional and end-to-end baselines for language-guided robotic grasp detection. The architecture’s modularity supports incremental tool expansion and prompt adaptation, mitigating the brittleness and domain-adaptation costs of traditional, retraining-based methods. A plausible implication is the broader potential for LLM-driven, feedback-intensive agents in robotics beyond grasping, wherever language-to-perception-to-action tasks require complex hierarchical reasoning and robust failure handling (Nguyen et al., 23 Jun 2025).