Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 165 tok/s
Gemini 2.5 Pro 57 tok/s Pro
GPT-5 Medium 39 tok/s Pro
GPT-5 High 37 tok/s Pro
GPT-4o 106 tok/s Pro
Kimi K2 185 tok/s Pro
GPT OSS 120B 445 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

VLAD-Grasp: Zero-Shot Robotic Grasping

Updated 15 November 2025
  • The paper introduces VLAD-Grasp, a zero-shot vision–language framework that detects antipodal grasps by synthesizing a geometric goal image and aligning 3D point clouds.
  • It uses a three-stage prompting pipeline with a pretrained VLM to generate a rod-impaled image, then lifts the representation to 3D via depth prediction and segmentation.
  • The method achieves competitive benchmark performance and robust real-world generalization, despite inference costs and occasional generation pitfalls.

VLAD-Grasp is a vision–LLM-assisted zero-shot approach for detecting antipodal robotic grasps without task-specific training or grasp dataset curation. The core innovation is to leverage large vision–LLMs (VLMs) to synthesize a geometric “goal image” in which a straight rod cleanly “impales” the target object; the rod’s axis encodes an executable antipodal grasp. VLAD-Grasp then lifts the synthesized representation to 3D using depth and segmentation, and solves for a grasp pose in the observed scene by point cloud alignment. This paradigm achieves competitive or superior performance to state-of-the-art supervised methods on standard benchmarks and generalizes to novel objects in real-world trials, demonstrating the power of pretrained vision–language priors in robotic manipulation.

1. Vision–Language-Driven Grasp Synthesis

VLAD-Grasp recasts grasp pose generation as an image editing problem to be solved by a pretrained VLM, specifically GPT-5 with image-conditioned generation capability. The procedure comprises a three-stage prompting pipeline:

  • Constraint Prompt (T0gT_0^g): The original RGB crop of the object (ISI_S) is provided with a prompt detailing the physical constraints (e.g., “The gripper is a two-finger parallel jaw, maximum opening 8 cm; the rod must be visible end-to-end; no other part of the scene should be altered.”).
  • Reasoning Prompt (T1gT_1^g): The model introspects over T0gT_0^g and ISI_S, outputting a single-sentence instruction to produce an image of a straight rod passing exactly through two antipodal object contact points.
  • Generation Prompt (T2gT_2^g and TcgT_c^g): The initial crop, geometric prompt, and further textual constraints (e.g., preserve texture, lighting, do not affect background; insert only a thin rod) are concatenated, along with an inpainting mask MSbM_S^b that only allows editing the object region. The VLM’s multimodal generation API, pθ(IGIS,T2g,Tcg,MSb)p_\theta(I_G | I_S, T_2^g, T_c^g, M_S^b), produces a 512×512512 \times 512 image IGI_G in which a slender rod visually “impales” the object.

This process encodes the desired grasp as an explicit geometric intervention, allowing the pose to be inferred by subsequent geometric computation rather than learned from prior annotated grasp data.

2. 3D Lifting via Depth and Segmentation

The generated goal image IGI_G is lifted to 3D using two off-the-shelf, non-trained computer vision models:

  • Monocular Depth Prediction: ML-Depth-Pro predicts a dense metric depth map DGD_G from IGI_G using a ViT-based architecture. The original network was trained on real and synthetic data with an 1\ell_1 loss (Ldepth=u,vDG(u,v)D^(u,v)L_{depth} = \sum_{u,v} |D_G(u,v) - \hat{D}(u,v)|), but no retraining or adaptation is performed.
  • Instance Segmentation: The Segment-Anything Model (SAM) produces binary masks for the object (MGoM_G^o) and the rod (MGrM_G^r), using canonical prompt-based segmentation.

Both models are used strictly as pretrained components, introducing no new loss, weights, or adaptations in VLAD-Grasp.

3. Point Cloud Construction and Alignment

To enable grasp planning, the masks and predicted depth maps are converted to 3D point clouds via calibrated pinhole back-projection:

  • Back-Projection: For each foreground pixel (ui,vi)(u_i, v_i) with Mo(ui,vi)=1M^o(u_i, v_i) = 1, a 3D point is computed as

pi=K1[ui,vi,1]DG(ui,vi)p_i = K^{-1} \begin{bmatrix} u_i, v_i, 1 \end{bmatrix}^\top \cdot D_G(u_i, v_i)

where KK denotes the 3×33 \times 3 camera intrinsics matrix. Two point clouds are created: - PGoR3P_G^o \subset \mathbb{R}^3: The generated object point cloud. - PSoR3P_S^o \subset \mathbb{R}^3: The observed scene object point cloud, built identically from the real RGB-D and mask MSoM_S^o. The rod point cloud PGrP_G^r is similarly computed from (DG,MGr)(D_G, M_G^r).

Alignment between the generated and real object point clouds is achieved in two steps:

  • PCA-Based Orientation Alignment: Both point clouds are centered and their covariance matrices Σ\Sigma decomposed. For each sign-flip or axis-swap triplet (i,j,k){±1}3(i,j,k) \in \{\pm 1\}^3, the candidate rotation matrix is formed as

Ri,j,k=[v1S,v2S,v3S]diag(iλ1G/λ1S,jλ2G/λ2S,kλ3G/λ3S)[v1G,v2G,v3G]1R^{i,j,k} = [v_1^S, v_2^S, v_3^S] \cdot \operatorname{diag}(i\sqrt{\lambda_1^G/\lambda_1^S}, \, j\sqrt{\lambda_2^G/\lambda_2^S}, \, k\sqrt{\lambda_3^G/\lambda_3^S}) \cdot [v_1^G, v_2^G, v_3^G]^{-1}

  • Chamfer Distance Refinement: The optimal rotation is selected by minimizing the Chamfer distance,

CD(P,Q)=pPminqQpq2+qQminpPqp2\operatorname{CD}(P, Q) = \sum_{p\in P} \min_{q\in Q} \|p-q\|^2 + \sum_{q\in Q} \min_{p\in P} \|q-p\|^2

over candidate sign patterns. The resulting rigid-body transform TSGT_{S \gets G}, parameterized as

TSG=Ttrans(μS)[R0 01]Ttrans(μG)T_{S \gets G} = T_{trans}(\mu_S) \cdot \begin{bmatrix} R^* & 0 \ 0 & 1 \end{bmatrix} \cdot T_{trans}(-\mu_G)

(with RR^* the optimal rotation and μS,G\mu_{S,G} the centroids), aligns the rod and object geometry for grasp pose recovery.

4. Grasp Pose Localization

Following point cloud alignment, the rod is transformed into the scene’s coordinate frame: PSr=TSGPGrP_S^r = T_{S\gets G} \cdot P_G^r. Projecting PSrP_S^r into the (scene) image yields a mask MSrM_S^r, within which the antipodal contacts are inferred and mapped to a standard 5D grasp parameterization (x,y,θ,w,h)(x, y, \theta, w, h):

  • The end points of the largest gap in the rod’s mask define the antipodal contact pixels (u1,v1)(u_1, v_1) and (u2,v2)(u_2, v_2).
  • The grasp axis θ\theta is given by arctan((v2v1)/(u2u1))\arctan((v_2 - v_1)/(u_2 - u_1)).
  • The pixel-wise gap length, mapped through depth and K1K^{-1}, yields the grasp width ww.
  • The grasp “center” is back-projected to 3D:

pc=K1[uc,vc,1]dcp_c = K^{-1}[u_c, v_c, 1]^\top d_c

  • The approach vector is fixed as a=(0,0,1)a = (0, 0, -1); the finger axis is normalized from K1[cosθ,sinθ,0]K^{-1}[\cos\theta, \sin\theta, 0]^\top, and the lateral axis is given by a×fa \times f.

These axes define the 3×33 \times 3 orientation for the gripper's camera-frame pose, with pcp_c as translation.

5. Benchmarking and Experimental Evaluation

VLAD-Grasp’s performance was quantitatively evaluated on Cornell and Jacquard benchmarks using the standard 25% IoU metric. Table 1 summarizes mean ±\pm standard deviation grasp success rates:

Method Cornell Jacquard
GR-ConvNet 72.14 ± 41.19 59.62 ± 46.13
GG-CNN 74.28 ± 40.30 71.48 ± 42.76
SE-ResUNet 86.07 ± 28.65 88.14 ± 30.43
GraspSAM 67.50 ± 44.19 73.71 ± 40.95
LGD w/ Query 37.98 ± 44.81 24.40 ± 41.39
LGD w/o Query 32.26 ± 42.47 24.40 ± 41.39
VLAD-Grasp (zero-shot) 91.43 ± 28.00 85.43 ± 36.15

Despite the absence of task-specific training, VLAD-Grasp achieves the highest mean success rate on Cornell and the second-highest on Jacquard. For real-world zero-shot trials on a Franka Emika Panda with a 2-finger parallel jaw gripper and wrist-mounted Orbbec Femto Mega RGB-D camera, the protocol entailed 5 grasps per object across 10 previously unseen household objects. Success, defined as the object remaining in-gripper after a 5 s vertical lift, was achieved in 88% of 50 trials. Failure modes included VLM rod hallucination (4%), depth-segmentation misalignment (5%), and challenges with translucent/specular objects (3%).

6. Advantages, Limitations, and Future Directions

VLAD-Grasp introduces a training-free, data-agnostic grasp detection pipeline that harnesses geometric and commonsense priors within large vision–LLMs. The key advantages are:

  • Zero-shot Transfer: Handles novel objects and environments without re-training or dataset curation.
  • Explicit Geometric Constraint: The use of a virtual rod as a grasp proxy directly encodes an interpretable, antipodal grasp hypothesis.
  • Competitive Performance: Exceeds or matches most supervised baselines under identical metrics and transfers directly to physical hardware.

Limitations of the approach include:

  • Generation Failures: Occasional misinterpretations of object geometry or prompt by the VLM.
  • Dependency on Third-Party Perception: Off-the-shelf depth and segmentation are susceptible to noise, particularly with transparent or reflective materials.
  • Inference Cost: The use of large generative models incurs higher latency (seconds per sample) compared to optimized CNN-based detectors.

Future work is focused on integrating lightweight, high-throughput multimodal generators with improved prompt adherence, employing multi-view input or temporal consistency for more robust perception, and developing prompt-tuning or synthetic “rod-impaled” supervision to further stabilize grasp proposal.

VLAD-Grasp demonstrates that a vision–language prompting and 3D geometric alignment pipeline enables high performance, zero-shot antipodal grasp detection in both simulation and real-world robotic settings, obviating the need for task-specific data or retraining.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to VLAD-Grasp.