Volumetric Grasping Network (VGN)
- The paper presents a novel end-to-end framework that predicts grasp quality, orientation, and width directly from volumetric TSDF inputs.
- It employs a fully convolutional encoder-decoder architecture with trilinear upsampling to efficiently capture spatial features for 6 DOF grasp synthesis.
- It achieves near-real-time performance, with ∼10 ms inference on GPUs, and demonstrates robust sim-to-real transfer by clearing up to 92% of cluttered objects.
The Volumetric Grasping Network (VGN) is an end-to-end, real-time, six degree-of-freedom (6 DOF) grasp synthesis network designed to predict robust robotic grasps from 3D scene volumes. VGN operates directly on a full-scene volumetric representation constructed from depth sensing, outputting at each voxel a grasp success probability (termed "quality"), a gripper orientation (as a unit quaternion), and an associated gripper opening width. The method achieves collision-aware grasping in cluttered environments without relying on explicit collision checking. VGN demonstrates strong sim-to-real transfer, high throughput (∼10 ms inference on commodity GPUs), and real-world efficacy, successfully clearing 92% of objects in tabletop clutter on a physical Franka Panda system with a wrist-mounted camera.
1. Volumetric Input Representation
VGN utilizes a Truncated Signed Distance Function (TSDF) to encode the geometry of the robot workspace. The workspace is a cubic volume of side length (typically 30 cm), discretized into an voxel grid, where , yielding a voxel size mm. TSDF values are integrated from depth images collected from a wrist-mounted depth camera at randomized spherical viewpoints using standard volumetric fusion techniques (e.g., Open3D).
Mathematically, for a spatial location with signed distance to nearest surface , the TSDF is defined as: where is a truncation distance parameter, and discretization over voxels yields .
2. Network Architecture
The VGN architecture is a fully convolutional 3D network, , composed of:
- Encoder ("perception stem"):
- 3D convolutional layers with stride 2 and output channels: 16 → 32 → 64, each followed by ReLU activation, producing a feature volume.
- Decoder with Upsampling:
- Three blocks, each consisting of Conv3D (padding 1, fixed channel size), followed by ReLU and 2× trilinear upsampling.
- Output Heads (all convolutions):
- Grasp Quality Head: Outputs sigmoid-activated scalar per voxel (), representing success probability.
- Gripper Orientation Head: Outputs 4-D vectors per voxel (), normalized to unit quaternions.
- Gripper Width Head: Outputs a scalar value per voxel (), linearly encoded.
All layers are convolutional; no fully-connected layers are employed. Strided convolutions reduce spatial dimensionality for computational efficiency, and upsampling is performed via trilinear interpolation.
3. Grasp Parameterization and Loss Functions
Each grasp candidate is defined as , where:
- : Gripper position,
- : Gripper orientation (unit quaternion),
- : Gripper width,
- : Ground truth grasp success.
The multi-task loss per ground-truth voxel is: with:
- Quality loss (): Binary cross entropy,
- Quaternion loss (): Symmetry-aware, accounts for 180° wrist symmetry:
where is rotated by 180° about the gripper’s axis,
- Width loss (): Squared error, .
4. Training Procedure
Training data are generated using PyBullet-based simulation. Two scene types are used ("pile" and "packed"), with object count , and object meshes from a set of 303 CAD models. TSDF volumes are reconstructed from synthetic depth views. Per scene, 120 grasp candidates are sampled by selecting surface points plus normals, testing 6 discrete gripper rotations, and labeling success or failure via simulated grasp execution; grasp width and orientation are recorded.
The final dataset comprises approximately 2 million labeled grasps, with balanced coverage over grasp types (including both top-down and side grasps). Data augmentation is performed using random multiples of 90° rotations about gravity and small vertical offsets. The network is trained with the Adam optimizer ( learning rate, batch size 32, 10 epochs).
5. Inference Pipeline
The inference procedure comprises the following steps:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
Input: TSDF volume V (N^3) from live depth integration (Q, R, W) = VGN(V) # Q: quality, R: quaternion map, W: width map Q = conv3d(Q, Gaussian3×3×3) mask = (V > finger_depth) # remove grasps too close to surfaces Q = Q * mask candidates = {(i, Q[i]) | Q[i]>ε} peaks = non_max_suppression(candidates) for each voxel index i in peaks: t_i = voxel_to_world(i) r_i = R[i] # unit quaternion w_i = W[i] * v # convert back to meters score = Q[i] add grasp (t_i,r_i,w_i,score) return sorted grasp list |
No explicit collision checker is invoked; the TSDF representation encodes the geometry, enabling VGN to learn collision-avoiding grasps implicitly. This implicit reasoning about free space is a distinctive property. The inference time per forward pass is about 10 ms on a GTX 1080 Ti GPU.
6. Empirical Performance
Performance is evaluated both in simulation (PyBullet) and on real robotic hardware (Franka Panda with wrist-mounted RealSense D435, CPU-only inference).
Simulation Results:
| Method | Blocks | Pile | Packed | (m=10) Blocks | (m=10) Pile |
|---|---|---|---|---|---|
| GPD | 88.6 / 39.4 | 59.9 / 26.1 | 73.7 / 72.8 | 87.7 / 24.8 | 63.1 / 17.0 |
| VGN (ε=0.95) | 89.5 / 85.9 | 65.4 / 41.6 | 91.5 / 79.0 | 85.3 / 66.7 | 59.4 / 25.1 |
| VGN (ε=0.90) | 87.6 / 90.1 | 62.3 / 46.4 | 87.6 / 80.4 | 82.5 / 77.6 | 59.3 / 34.6 |
| VGN (ε=0.80) | 85.8 / 89.5 | 59.8 / 51.1 | 84.0 / 79.9 | 78.4 / 69.0 | 52.8 / 30.1 |
- VGN consistently clears more objects than GPD, particularly in "packed" scenes.
- The choice of the quality threshold ε allows trading off between individual grasp success rate and overall scene clearance; ε=0.90 balances precision and recall.
- Inference time is ∼10 ms (VGN, on GPU) vs. ∼1.2 s (GPD).
Real-World Results:
- In 10 table-clearing rounds (6 objects per round, 68 total grasps): 80% grasp success rate, 92% of objects cleared.
- Failure cases are primarily low-friction cylindrical objects or collisions from insufficiently conservative geometry encoding.
- CPU inference time is ∼1.25 s per frame; expected to match GPU speed (∼10 ms) if hardware is available.
7. Limitations and Prospective Extensions
Failure scenarios include objects that are very thin or possess low-friction surfaces, leading to sim-to-real mismatch and overconfident grasp proposals, and scenes with heavy occlusion or extreme clutter that exceed the diversity introduced during training. Potential improvements outlined include adversarial physics (randomized friction), domain randomization, few-shot real-data fine-tuning, and expanding the core model for closed-loop dynamic grasping (leveraging the 10 ms inference for real-time feedback).
Extensions to multi-fingered hands could be realized by replacing the width head with a more expressive hand-pose head; multi-agent or multi-view scenarios would require coordinated TSDF fusion from multiple sensors or robots for larger workspaces.
VGN provides a geometry-aware, fully convolutional framework for 6 DOF grasp detection in clutter—achieving near-real-time performance and robust sim-to-real transfer without explicit collision checking.