VLAD-Grasp: Zero-Shot Robotic Grasping

Updated 15 November 2025

The paper introduces VLAD-Grasp, a zero-shot vision–language framework that detects antipodal grasps by synthesizing a geometric goal image and aligning 3D point clouds.
It uses a three-stage prompting pipeline with a pretrained VLM to generate a rod-impaled image, then lifts the representation to 3D via depth prediction and segmentation.
The method achieves competitive benchmark performance and robust real-world generalization, despite inference costs and occasional generation pitfalls.

VLAD-Grasp is a vision–LLM-assisted zero-shot approach for detecting antipodal robotic grasps without task-specific training or grasp dataset curation. The core innovation is to leverage large vision–LLMs (VLMs) to synthesize a geometric “goal image” in which a straight rod cleanly “impales” the target object; the rod’s axis encodes an executable antipodal grasp. VLAD-Grasp then lifts the synthesized representation to 3D using depth and segmentation, and solves for a grasp pose in the observed scene by point cloud alignment. This paradigm achieves competitive or superior performance to state-of-the-art supervised methods on standard benchmarks and generalizes to novel objects in real-world trials, demonstrating the power of pretrained vision–language priors in robotic manipulation.

1. Vision–Language-Driven Grasp Synthesis

VLAD-Grasp recasts grasp pose generation as an image editing problem to be solved by a pretrained VLM, specifically GPT-5 with image-conditioned generation capability. The procedure comprises a three-stage prompting pipeline:

Constraint Prompt ( $T_0^g$ ): The original RGB crop of the object ( $I_S$ ) is provided with a prompt detailing the physical constraints (e.g., “The gripper is a two-finger parallel jaw, maximum opening 8 cm; the rod must be visible end-to-end; no other part of the scene should be altered.”).
Reasoning Prompt ( $T_1^g$ ): The model introspects over $T_0^g$ and $I_S$ , outputting a single-sentence instruction to produce an image of a straight rod passing exactly through two antipodal object contact points.
Generation Prompt ( $T_2^g$ and $T_c^g$ ): The initial crop, geometric prompt, and further textual constraints (e.g., preserve texture, lighting, do not affect background; insert only a thin rod) are concatenated, along with an inpainting mask $M_S^b$ that only allows editing the object region. The VLM’s multimodal generation API, $p_\theta(I_G | I_S, T_2^g, T_c^g, M_S^b)$ , produces a $512 \times 512$ image $I_G$ in which a slender rod visually “impales” the object.

This process encodes the desired grasp as an explicit geometric intervention, allowing the pose to be inferred by subsequent geometric computation rather than learned from prior annotated grasp data.

2. 3D Lifting via Depth and Segmentation

The generated goal image $I_G$ is lifted to 3D using two off-the-shelf, non-trained computer vision models:

Monocular Depth Prediction: ML-Depth-Pro predicts a dense metric depth map $D_G$ from $I_G$ using a ViT-based architecture. The original network was trained on real and synthetic data with an $\ell_1$ loss ( $L_{depth} = \sum_{u,v} |D_G(u,v) - \hat{D}(u,v)|$ ), but no retraining or adaptation is performed.
Instance Segmentation: The Segment-Anything Model (SAM) produces binary masks for the object ( $M_G^o$ ) and the rod ( $M_G^r$ ), using canonical prompt-based segmentation.

Both models are used strictly as pretrained components, introducing no new loss, weights, or adaptations in VLAD-Grasp.

3. Point Cloud Construction and Alignment

To enable grasp planning, the masks and predicted depth maps are converted to 3D point clouds via calibrated pinhole back-projection:

Back-Projection: For each foreground pixel $(u_i, v_i)$ with $M^o(u_i, v_i) = 1$ , a 3D point is computed as

$p_i = K^{-1} \begin{bmatrix} u_i, v_i, 1 \end{bmatrix}^\top \cdot D_G(u_i, v_i)$

where $K$ denotes the $3 \times 3$ camera intrinsics matrix. Two point clouds are created: - $P_G^o \subset \mathbb{R}^3$ : The generated object point cloud. - $P_S^o \subset \mathbb{R}^3$ : The observed scene object point cloud, built identically from the real RGB-D and mask $M_S^o$ . The rod point cloud $P_G^r$ is similarly computed from $(D_G, M_G^r)$ .

Alignment between the generated and real object point clouds is achieved in two steps:

PCA-Based Orientation Alignment: Both point clouds are centered and their covariance matrices $\Sigma$ decomposed. For each sign-flip or axis-swap triplet $(i,j,k) \in \{\pm 1\}^3$ , the candidate rotation matrix is formed as

$R^{i,j,k} = [v_1^S, v_2^S, v_3^S] \cdot \operatorname{diag}(i\sqrt{\lambda_1^G/\lambda_1^S}, \, j\sqrt{\lambda_2^G/\lambda_2^S}, \, k\sqrt{\lambda_3^G/\lambda_3^S}) \cdot [v_1^G, v_2^G, v_3^G]^{-1}$

Chamfer Distance Refinement: The optimal rotation is selected by minimizing the Chamfer distance,

$\operatorname{CD}(P, Q) = \sum_{p\in P} \min_{q\in Q} \|p-q\|^2 + \sum_{q\in Q} \min_{p\in P} \|q-p\|^2$

over candidate sign patterns. The resulting rigid-body transform $T_{S \gets G}$ , parameterized as

$T_{S \gets G} = T_{trans}(\mu_S) \cdot \begin{bmatrix} R^* & 0 \ 0 & 1 \end{bmatrix} \cdot T_{trans}(-\mu_G)$

(with $R^*$ the optimal rotation and $\mu_{S,G}$ the centroids), aligns the rod and object geometry for grasp pose recovery.

4. Grasp Pose Localization

Following point cloud alignment, the rod is transformed into the scene’s coordinate frame: $P_S^r = T_{S\gets G} \cdot P_G^r$ . Projecting $P_S^r$ into the (scene) image yields a mask $M_S^r$ , within which the antipodal contacts are inferred and mapped to a standard 5D grasp parameterization $(x, y, \theta, w, h)$ :

The end points of the largest gap in the rod’s mask define the antipodal contact pixels $(u_1, v_1)$ and $(u_2, v_2)$ .
The grasp axis $\theta$ is given by $\arctan((v_2 - v_1)/(u_2 - u_1))$ .
The pixel-wise gap length, mapped through depth and $K^{-1}$ , yields the grasp width $w$ .
The grasp “center” is back-projected to 3D:

$p_c = K^{-1}[u_c, v_c, 1]^\top d_c$

The approach vector is fixed as $a = (0, 0, -1)$ ; the finger axis is normalized from $K^{-1}[\cos\theta, \sin\theta, 0]^\top$ , and the lateral axis is given by $a \times f$ .

These axes define the $3 \times 3$ orientation for the gripper's camera-frame pose, with $p_c$ as translation.

5. Benchmarking and Experimental Evaluation

VLAD-Grasp’s performance was quantitatively evaluated on Cornell and Jacquard benchmarks using the standard 25% IoU metric. Table 1 summarizes mean $\pm$ standard deviation grasp success rates:

Method	Cornell	Jacquard
GR-ConvNet	72.14 ± 41.19	59.62 ± 46.13
GG-CNN	74.28 ± 40.30	71.48 ± 42.76
SE-ResUNet	86.07 ± 28.65	88.14 ± 30.43
GraspSAM	67.50 ± 44.19	73.71 ± 40.95
LGD w/ Query	37.98 ± 44.81	24.40 ± 41.39
LGD w/o Query	32.26 ± 42.47	24.40 ± 41.39
VLAD-Grasp (zero-shot)	91.43 ± 28.00	85.43 ± 36.15

Despite the absence of task-specific training, VLAD-Grasp achieves the highest mean success rate on Cornell and the second-highest on Jacquard. For real-world zero-shot trials on a Franka Emika Panda with a 2-finger parallel jaw gripper and wrist-mounted Orbbec Femto Mega RGB-D camera, the protocol entailed 5 grasps per object across 10 previously unseen household objects. Success, defined as the object remaining in-gripper after a 5 s vertical lift, was achieved in 88% of 50 trials. Failure modes included VLM rod hallucination (4%), depth-segmentation misalignment (5%), and challenges with translucent/specular objects (3%).

6. Advantages, Limitations, and Future Directions

VLAD-Grasp introduces a training-free, data-agnostic grasp detection pipeline that harnesses geometric and commonsense priors within large vision–LLMs. The key advantages are:

Zero-shot Transfer: Handles novel objects and environments without re-training or dataset curation.
Explicit Geometric Constraint: The use of a virtual rod as a grasp proxy directly encodes an interpretable, antipodal grasp hypothesis.
Competitive Performance: Exceeds or matches most supervised baselines under identical metrics and transfers directly to physical hardware.

Limitations of the approach include:

Generation Failures: Occasional misinterpretations of object geometry or prompt by the VLM.
Dependency on Third-Party Perception: Off-the-shelf depth and segmentation are susceptible to noise, particularly with transparent or reflective materials.
Inference Cost: The use of large generative models incurs higher latency (seconds per sample) compared to optimized CNN-based detectors.

Future work is focused on integrating lightweight, high-throughput multimodal generators with improved prompt adherence, employing multi-view input or temporal consistency for more robust perception, and developing prompt-tuning or synthetic “rod-impaled” supervision to further stabilize grasp proposal.

VLAD-Grasp demonstrates that a vision–language prompting and 3D geometric alignment pipeline enables high performance, zero-shot antipodal grasp detection in both simulation and real-world robotic settings, obviating the need for task-specific data or retraining.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to VLAD-Grasp.