Papers
Topics
Authors
Recent
2000 character limit reached

GraspMAS: Zero-Shot Multi-Agent Grasping

Updated 2 December 2025
  • GraspMAS is a zero-shot multi-agent framework that uses Planner, Coder, and Observer to interpret natural language commands and predict grasp poses.
  • It employs a closed-loop protocol with LLMs and visual-linguistic tools to iteratively refine code and feedback for robust grasp detection.
  • Benchmark evaluations show GraspMAS outperforms existing methods by delivering higher success rates and greater adaptability in complex, dense environments.

GraspMAS is a zero-shot, multi-agent system framework for language-driven grasp detection, enabling robotic manipulators to interpret free-form natural language commands and execute targeted grasps in real-world, densely cluttered visual environments. At its core, GraspMAS orchestrates three specialized agents—Planner, Coder, and Observer—in a closed-loop protocol, leveraging LLMs and pretrained visual-linguistic tools to achieve robust, adaptable, and training-free grasp pose prediction, formalized as a five-parameter output g=(x,y,w,h,θ)g = (x, y, w, h, \theta), given RGB image input II and textual instruction TT (Nguyen et al., 23 Jun 2025).

1. System Architecture and Agent Roles

GraspMAS decomposes the grasp detection pipeline into three intercommunicating agents:

Planner: An LLM-based agent that receives (I,T)(I, T) and optional feedback hh, outputting a symbolic plan p=[tool_call1,...,tool_callk]p = [\text{tool\_call}_1, ..., \text{tool\_call}_k]. The Planner resolves semantic and spatial ambiguities in TT, defines high-level strategies, and delegates subtasks to specialized tools.

Coder: A code-generation agent that translates the Planner’s plan and associated low-level APIs into executable Python code C(p)C(p). The code instantiates and invokes tools from the set A\mathcal{A} (e.g., find, masks, compute_depth, grasp_detection), processes II, and yields intermediate artifacts r={cropped images, depth maps, bounding boxes, grasp poses, error logs}r = \{\text{cropped images, depth maps, bounding boxes, grasp poses, error logs}\}.

Observer: A multi-modal LLM that critiques the Coder’s outputs. The Observer inspects rr, verifies grasp feasibility, workspace constraints, and detection integrity, and produces structured textual feedback hh (e.g., "object too far", "mask not found", "grasp collides") that is routed back to the Planner for iterative refinement.

This closed-loop design persists over one or more cycles until a convergence or termination condition, upon which the grasp pose gfinalg_\text{final} is output.

2. Formal Problem Formulation

Given IRH×W×3I \in \mathbb{R}^{H \times W \times 3} (RGB scene) and TT (natural-language command), the objective is to output a grasp g=(x,y,w,h,θ)g = (x, y, w, h, \theta) such that:

  • The manipulated object(s) match the textual referent in TT.
  • The pose satisfies geometric grasp stability criteria, as per the detection tool.
  • The solution resides within the robot’s reachable workspace.

The zero-shot setting precludes task-specific finetuning; all agent operations depend on pretrained models and prompt engineering.

Plans are formalized as sequences: p=[(a1,ϕ1),(a2,ϕ2),...,(ak,ϕk)],aiAp = [(a_1, \phi_1), (a_2, \phi_2), ..., (a_k, \phi_k)], \quad a_i \in \mathcal{A} where ϕi\phi_i are arguments drawn from TT and prior results. The system performs iterative plan refinement via: pPlan(I,T,h)p \leftarrow \text{Plan}(I, T, h) with the iterative loop terminated by explicit feedback or a maximum iteration threshold NN.

Grasp selection seeks to maximize a zero-shot grasp score: g=argmaxgS(gI,T),S(gI,T)=Slang(g,T)Sgeom(g,I)Sws(g)g^* = \arg\max_g S(g|I,T), \quad S(g|I,T) = S_\text{lang}(g,T) \cdot S_\text{geom}(g,I) \cdot S_\text{ws}(g) where:

  • SlangS_\text{lang}: alignment of crop with language query,
  • SgeomS_\text{geom}: output of grasp_detection,
  • SwsS_\text{ws}: binary workspace reachability.

3. Agent Algorithms and Coordination Protocol

Each agent executes a distinct role through prompt-structured interaction or code execution:

Planner: Constructs its prompt from a role description, tool API summaries, and in-context examples. Upon receiving (I,T,h)(I, T, h), it requests the LLM to generate a stepwise plan via tool calls.

Coder: Builds a Python code template, populates it with instantiations of each aiAa_i \in \mathcal{A} using the planned arguments, and executes the code in a managed environment. Errors are captured as logs, which supplement the Observer’s input.

Observer: Receives visual and log outputs, assembles a multimodal prompt, and utilizes an LLM to appraise grasp validity, collision, and perceptual failures.

Coordination is governed by a loop:

1
2
3
4
5
6
7
h = ""
for t in range(1, N+1):
    p = Planner(I, T, h)
    r, e = Coder(p, I)
    h = Observer(r, e)
    if h indicates "Solution found": break
g_final = extract_grasp(r)
The agents communicate until Observer feedback signals convergence or a maximum number of attempts is reached.

4. Inference Pipeline and Zero-Shot Operation

The GraspMAS pipeline operates without any task-specific gradient-based optimization, relying solely on pre-established LLM and tool capabilities. Input images are resized to a canonical resolution (e.g., 640×640640 \times 640), and language commands undergo minimal normalization.

Zero-shot inference proceeds as follows:

  1. Prepare the Planner’s system prompt, providing tool API docs and 2 in-context (I,T)(I, T) \rightarrow plan \rightarrow code g\rightarrow g examples.
  2. Invoke the Planner for the initial plan.
  3. Iterate Coder and Observer per the closed-loop protocol until convergence.

All learning in GraspMAS consists of LLM prompt engineering and tool calls; no model parameters are updated during deployment.

5. Experimental Evaluation

GraspMAS is evaluated on two large-scale benchmarks:

  • GraspAnything++: 1 million images and 3 million objects with language queries; evaluations focus on the official 100,000-query test split.
  • OCID-VLG: 1,763 highly cluttered tabletops with 17,700 test query–image pairs.

Success Rate (SR)—the primary metric—requires that for a predicted grasp g^\hat{g} and ground truth gg: IoU=area(g^g)area(g^g)0.25,Δθ=θ^θ30IoU = \frac{\text{area}(\hat{g} \cap g)}{\text{area}(\hat{g} \cup g)} \geq 0.25, \quad \Delta\theta = |\hat{\theta} - \theta| \leq 30^\circ where SR=1Ni=1N1[successi]SR = \frac{1}{N} \sum_{i=1}^N \mathbf{1}[\text{success}_i].

Comparative performance:

Method OCID-VLG SR GraspAny++ SR Inference (s)
OWLv2 + RAGT (E2E) 0.22 0.24 0.37
GroundingDINO + RAGT (E2E) 0.27 0.33 0.32
QWEN2-VL + RAGT (E2E) 0.41 0.48 1.13
OWG (Compositional) 0.53 0.42 3.35
ViperGPT (Compositional) 0.44 0.57 1.04
GraspMAS 0.62 0.68 2.12

Qualitative capabilities: The framework handles complex language and visual reasoning, such as selecting the "second bottle from the left" or the object satisfying "used for cutting," leveraging Planner strategies like spatial sorting and property verification through tool calls, and Observer confirmation of affordances.

6. Ablation and Sensitivity Analyses

Ablation studies clarify the function of each agent:

  • Omitting the Observer loop (Planner→Coder only) lowers OCID-VLG SR from 0.62 to 0.49.
  • Disabling Planner refinement (i.e., single pass) yields SR =0.53= 0.53.
  • Removing Coder error feedback (allowing code execution failures) yields SR =0.57= 0.57.

The system’s performance degrades as scene clutter increases (SR drops by 12% as object count increases from 3 to 15). Functional queries (e.g., "to cut," "to write") are better resolved when Planner invokes semantic LLM queries, with SR improving by +7% when this tool is enabled (Nguyen et al., 23 Jun 2025).

7. Context and Implications

GraspMAS demonstrates that zero-shot, closed-loop multi-agent coordination leveraging LLM planning, code generation, multimodal feedback, and pretrained visual-language tool APIs achieves state-of-the-art results on both compositional and end-to-end baselines for language-guided robotic grasp detection. The architecture’s modularity supports incremental tool expansion and prompt adaptation, mitigating the brittleness and domain-adaptation costs of traditional, retraining-based methods. A plausible implication is the broader potential for LLM-driven, feedback-intensive agents in robotics beyond grasping, wherever language-to-perception-to-action tasks require complex hierarchical reasoning and robust failure handling (Nguyen et al., 23 Jun 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to GraspMAS Framework.