Volumetric Grasping Network (VGN)

Updated 9 November 2025

The paper presents a novel end-to-end framework that predicts grasp quality, orientation, and width directly from volumetric TSDF inputs.
It employs a fully convolutional encoder-decoder architecture with trilinear upsampling to efficiently capture spatial features for 6 DOF grasp synthesis.
It achieves near-real-time performance, with ∼10 ms inference on GPUs, and demonstrates robust sim-to-real transfer by clearing up to 92% of cluttered objects.

The Volumetric Grasping Network (VGN) is an end-to-end, real-time, six degree-of-freedom (6 DOF) grasp synthesis network designed to predict robust robotic grasps from 3D scene volumes. VGN operates directly on a full-scene volumetric representation constructed from depth sensing, outputting at each voxel a grasp success probability (termed "quality"), a gripper orientation (as a unit quaternion), and an associated gripper opening width. The method achieves collision-aware grasping in cluttered environments without relying on explicit collision checking. VGN demonstrates strong sim-to-real transfer, high throughput (∼10 ms inference on commodity GPUs), and real-world efficacy, successfully clearing 92% of objects in tabletop clutter on a physical Franka Panda system with a wrist-mounted camera.

1. Volumetric Input Representation

VGN utilizes a Truncated Signed Distance Function (TSDF) to encode the geometry of the robot workspace. The workspace is a cubic volume of side length $l$ (typically 30 cm), discretized into an $N \times N \times N$ voxel grid, where $N = 40$ , yielding a voxel size $v = l/N \approx 7.5$ mm. TSDF values are integrated from $n \sim \mathcal U(1,6)$ depth images collected from a wrist-mounted depth camera at randomized spherical viewpoints using standard volumetric fusion techniques (e.g., Open3D).

Mathematically, for a spatial location $x \in \mathbb R^3$ with signed distance to nearest surface $d(x)$ , the TSDF is defined as: $\Phi(x) = \operatorname{sgn}(d(x))\, \min(|d(x)|, \tau),$ where $\tau$ is a truncation distance parameter, and discretization over voxels yields $V_i = \Phi(x_i)$ .

2. Network Architecture

The VGN architecture is a fully convolutional 3D network, $f: \mathbb R^{N^3} \rightarrow \mathbb R^{(1+4+1) \times N^3}$ , composed of:

Encoder ("perception stem"):
- 3D convolutional layers with stride 2 and output channels: 16 → 32 → 64, each followed by ReLU activation, producing a $64 \times 5^3$ feature volume.
Decoder with Upsampling:
- Three blocks, each consisting of $3 \times 3 \times 3$ Conv3D (padding 1, fixed channel size), followed by ReLU and 2× trilinear upsampling.
Output Heads (all $1 \times 1 \times 1$ convolutions):

Grasp Quality Head: Outputs sigmoid-activated scalar per voxel ( $1 \times N^3$ ), representing success probability.
Gripper Orientation Head: Outputs 4-D vectors per voxel ( $4 \times N^3$ ), normalized to unit quaternions.
Gripper Width Head: Outputs a scalar value per voxel ( $1 \times N^3$ ), linearly encoded.

All layers are convolutional; no fully-connected layers are employed. Strided convolutions reduce spatial dimensionality for computational efficiency, and upsampling is performed via trilinear interpolation.

3. Grasp Parameterization and Loss Functions

Each grasp candidate is defined as $g = (\mathbf t, \mathbf r, w, q)$ , where:

$\mathbf t \in \mathbb R^3$ : Gripper position,
$\mathbf r \in \mathbb S^3$ : Gripper orientation (unit quaternion),
$w$ : Gripper width,
$q \in \{0,1\}$ : Ground truth grasp success.

The multi-task loss per ground-truth voxel $i$ is: $\mathcal L_i = \mathcal L_q(\hat q_i, q_i) + q_i \left( \mathcal L_r(\hat r_i, r_i) + \mathcal L_w(\hat w_i, w_i) \right),$ with:

Quality loss ( $\mathcal L_q$ ): Binary cross entropy,
Quaternion loss ( $\mathcal L_r$ ): Symmetry-aware, accounts for 180° wrist symmetry:

$\mathcal L_r(\hat r, \tilde r) = \min(1 - |\hat r \cdot \tilde r|,\; 1 - |\hat r \cdot \tilde r_\pi|),$

where $\tilde r_\pi$ is $\tilde r$ rotated by 180° about the gripper’s axis,

Width loss ( $\mathcal L_w$ ): Squared error, $(\hat w - w)^2$ .

4. Training Procedure

Training data are generated using PyBullet-based simulation. Two scene types are used ("pile" and "packed"), with object count $m \sim \mathrm{Pois}(4)+1$ , and object meshes from a set of 303 CAD models. TSDF volumes are reconstructed from $n \sim \mathcal U(1,6)$ synthetic depth views. Per scene, 120 grasp candidates are sampled by selecting surface points plus normals, testing 6 discrete gripper rotations, and labeling success or failure via simulated grasp execution; grasp width and orientation are recorded.

The final dataset comprises approximately 2 million labeled grasps, with balanced coverage over grasp types (including both top-down and side grasps). Data augmentation is performed using random multiples of 90° rotations about gravity and small vertical offsets. The network is trained with the Adam optimizer ( $3 \times 10^{-4}$ learning rate, batch size 32, 10 epochs).

5. Inference Pipeline

The inference procedure comprises the following steps:

Input: TSDF volume V (N^3) from live depth integration
(Q, R, W) = VGN(V)          # Q: quality, R: quaternion map, W: width map
Q = conv3d(Q, Gaussian3×3×3)
mask = (V > finger_depth)   # remove grasps too close to surfaces
Q = Q * mask
candidates = {(i, Q[i]) | Q[i]>ε}
peaks = non_max_suppression(candidates)
for each voxel index i in peaks:
  t_i = voxel_to_world(i)
  r_i = R[i]       # unit quaternion
  w_i = W[i] * v   # convert back to meters
  score = Q[i]
  add grasp (t_i,r_i,w_i,score)
return sorted grasp list

No explicit collision checker is invoked; the TSDF representation encodes the geometry, enabling VGN to learn collision-avoiding grasps implicitly. This implicit reasoning about free space is a distinctive property. The inference time per forward pass is about 10 ms on a GTX 1080 Ti GPU.

6. Empirical Performance

Performance is evaluated both in simulation (PyBullet) and on real robotic hardware (Franka Panda with wrist-mounted RealSense D435, CPU-only inference).

Simulation Results:

Method	Blocks	Pile	Packed	(m=10) Blocks	(m=10) Pile
GPD	88.6 / 39.4	59.9 / 26.1	73.7 / 72.8	87.7 / 24.8	63.1 / 17.0
VGN (ε=0.95)	89.5 / 85.9	65.4 / 41.6	91.5 / 79.0	85.3 / 66.7	59.4 / 25.1
VGN (ε=0.90)	87.6 / 90.1	62.3 / 46.4	87.6 / 80.4	82.5 / 77.6	59.3 / 34.6
VGN (ε=0.80)	85.8 / 89.5	59.8 / 51.1	84.0 / 79.9	78.4 / 69.0	52.8 / 30.1

VGN consistently clears more objects than GPD, particularly in "packed" scenes.
The choice of the quality threshold ε allows trading off between individual grasp success rate and overall scene clearance; ε=0.90 balances precision and recall.
Inference time is ∼10 ms (VGN, on GPU) vs. ∼1.2 s (GPD).

Real-World Results:

In 10 table-clearing rounds (6 objects per round, 68 total grasps): 80% grasp success rate, 92% of objects cleared.
Failure cases are primarily low-friction cylindrical objects or collisions from insufficiently conservative geometry encoding.
CPU inference time is ∼1.25 s per frame; expected to match GPU speed (∼10 ms) if hardware is available.

7. Limitations and Prospective Extensions

Failure scenarios include objects that are very thin or possess low-friction surfaces, leading to sim-to-real mismatch and overconfident grasp proposals, and scenes with heavy occlusion or extreme clutter that exceed the diversity introduced during training. Potential improvements outlined include adversarial physics (randomized friction), domain randomization, few-shot real-data fine-tuning, and expanding the core model for closed-loop dynamic grasping (leveraging the 10 ms inference for real-time feedback).

Extensions to multi-fingered hands could be realized by replacing the width head with a more expressive hand-pose head; multi-agent or multi-view scenarios would require coordinated TSDF fusion from multiple sensors or robots for larger workspaces.

VGN provides a geometry-aware, fully convolutional framework for 6 DOF grasp detection in clutter—achieving near-real-time performance and robust sim-to-real transfer without explicit collision checking.

PDF Markdown Chat (Pro)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Volumetric Grasping Network (VGN).