Dex-Net 2.0 Dataset for Robotic Grasping
- Dex-Net 2.0 is a large-scale synthetic dataset featuring 6.7 million datapoints with rendered depth images, grasp parameters, and analytic quality metrics for robust robotic grasp planning.
- It integrates 1,500 3D object models with detailed noise simulation and parallel-jaw grasp sampling to provide supervisory data for deep neural network training.
- The dataset supports both synthetic validation and real-world trials, achieving high classification accuracy and success rates in rapid grasp planning benchmarks.
Dex-Net 2.0 is a large-scale synthetic dataset developed for robust robotic grasp planning. It consists of 6.7 million datapoints, each comprising a rendered point cloud, a parallel-jaw grasp configuration, and an analytic grasp robustness metric. Constructed from 1,500 3D object models in thousands of randomized poses, Dex-Net 2.0 was specifically designed to provide supervisory data for data-driven grasp quality prediction using only depth images. This dataset enables rapid and reliable grasp planning with deep neural networks, bridging the gap between analytic and empirical approaches in robotic manipulation (Mahler et al., 2017).
1. Dataset Construction Pipeline
3D Model Collection and Preparation
Dex-Net 2.0 incorporates 1,500 object meshes: 1,371 synthetic models from 3DNet and 129 laser scans from the KIT database. Each mesh is aligned to its principal axes, scaled to fit within a 5 cm gripper aperture, and assigned unit mass. Stable object poses are computed using quasi-static equilibrium analysis, with low-probability (unstable) poses excluded.
Synthetic Point Cloud Generation
For every stable pose of each object, a camera is virtually positioned by sampling a viewpoint in spherical coordinates:
- Radial distance m
- Azimuth
- Polar tilt radians
A depth image is then rendered using a pinhole camera model and perspective projection. To simulate sensor noise, each pixel is corrupted as , where is sampled from a Gamma distribution (parameters , ) for multiplicative (depth-proportional) noise and is sampled from a zero-mean Gaussian Process (kernel bandwidth pixels, m per pixel).
Grasp Sampling and Representation
Parallel-jaw grasps are parameterized by , where (image-plane grasp center), (jaw opening rotation in the table plane), and (gripper approach height, sampled at 1 cm granularity). Candidate grasps are generated by uniform rejection sampling of antipodal point pairs on the object surface that satisfy the friction-cone grasp criterion:
- For candidates with normals ,
- Both contact normals must lie within an angle of the grasp axis, initially using friction coefficient (increased if insufficient grasps are found).
Analytic Grasp Quality Metrics
For each grasp and state , two metrics are computed:
- Collision-free status: if the gripper does not penetrate object or table.
- Robust -metric (Ferrari–Canny quality under uncertainty): For each perturbation sample , compute ; the robust value is .
A grasp is labeled successful (label ) if and , with mN.
2. Dataset Structure and Statistical Properties
- Total datapoints: 6.7 million pairs, where each datapoint includes a rendered depth image, grasp parameters, a binary success label , and robustness score .
- Object models: 1,500 unique objects.
- Grasps per object: Up to 100 antipodal grasps for each stable pose ( candidates per pose, with 100 subsampled).
- Positive label (successful grasps) fraction: (approximately 1.42 million).
- Datapoint format: Each contains (a) a “grasp image” rotated/translated/cropped so that the grasp center is image-center and jaw axis aligns with the center row, (b) scalar grasp height , and (c) label .
- Storage: Data are organized in HDF5 files (per batch), with separate datasets for image patches, heights, and labels.
| Attribute | Value / Description | Notes |
|---|---|---|
| Total datapoints | 6.7 million | (depth image, grasp, label, Q) |
| Object models | 1,500 | 1,371 synthetic, 129 scanned |
| Grasp candidates | ≤ 100 per stable pose; ~500 sampled per pose | Antipodal, friction cone |
| Successful fraction | ≈21.2% | ~1.42M positive labels |
| Image patch size | Aligned to grasp/jaw axis |
3. Training and Evaluation Protocols
Grasp Quality CNN (GQ-CNN)
Dex-Net 2.0 is designed for training the GQ-CNN, which predicts the probability of success for a grasp from a depth patch and scalar gripper height .
- Architecture: Four convolutional layers (size: , , , ; 64 filters each; ReLU activations; first layer has local response normalization), followed by three fully connected layers (1024, 1024, 2 units; height fused via separate tower), totaling approximately 18 million parameters.
- Output: Probability .
- Loss: Cross-entropy between label and prediction .
- Optimization: SGD with momentum 0.9, batch size 128, learning-rate exponential decay, weights initialized .
- Data augmentation: Rotation (multiples of ), horizontal/vertical flips, on-the-fly noise injection (matching the rendering noise model).
Evaluation Protocols
- Synthetic classification: Use 20% held-out synthetic data for validation; report accuracy and ROC AUC.
- Physical grasping benchmarks: ABB YuMi robot with silicone fingertip grippers, singulated objects, input is full depth + bounding box, pipeline includes sampling antipodal grasps, scoring by GQ-CNN, filtering for reachability/collisions.
- Known objects: 8 adversarial 3D-printed models (80 trials)
- Novel objects: 10 household items (50 trials)
- Metrics: Success rate (number of successful attempts / number attempted), precision (successes / grasps with ), planning time (from image to grasp command).
4. Empirical Performance
Synthetic Dataset Classification
- GQ-CNN trained on the full dataset achieves accuracy on held-out synthetic validation data.
Real-World Grasping Trials
- Known objects (8 adversarial): 93% ± 6% success rate, 94% precision, 0.8s average planning time (GQ-CNN planner, greedy policy).
- Baseline comparison: Point-cloud registration planner attains 95% ± 5% but requires 2.6s (≈3× slower); image heuristics and ML baselines (ML-RF, ML-SVM) achieve 70% and 75–80% success, respectively.
- Novel objects (10 household items): 80% ± 11% success rate, 100% precision, 0.8s planning.
- Generalization (40 “in-the-wild” objects): Using Cross-Entropy Method (CEM) optimization, success is 94%, precision 99%, planning 2.5s.
System Integration
Dex-Net 2.0, combined with a push-separation policy, enabled ABB YuMi to correctly pack 3 target objects in 4 out of 5 order-fulfillment trials, yielding a 93% grasp success rate over 27 attempts.
5. Impact and Availability
Dex-Net 2.0 demonstrates that large-scale, fully synthetic datasets can enable robust and efficient grasp planning from depth images, rivaling registration-based methods in reliability while offering a 3× reduction in planning latency. The dataset and its associated GQ-CNN pipeline support research into analytic learning for robotic manipulation and provide a standardized benchmark for robust, generalizable grasp planning. Code, dataset, and supporting materials are accessible at http://berkeleyautomation.github.io/dex-net (Mahler et al., 2017).
6. Relevance to Robotic Manipulation Research
Dex-Net 2.0 is used to investigate learning-based grasp evaluation under analytic supervision, bridging the gap between traditional analytic metrics (such as Ferrari–Canny’s -metric) and deep neural perception. The dataset’s scale supports direct empirical comparison of learning-based approaches to classical registration pipelines and enables systematic benchmarking of accuracy, precision, planning speed, and generalization to novel, articulated, and deformable objects.
A plausible implication is that large-scale synthetic pipelines—integrating CAD models, analytic criteria, noise modeling, and deep learning—may set a new paradigm for training and evaluating perception-driven robotic manipulation algorithms.