Dex-Net 2.0 Dataset for Robotic Grasping

Updated 19 March 2026

Dex-Net 2.0 is a large-scale synthetic dataset featuring 6.7 million datapoints with rendered depth images, grasp parameters, and analytic quality metrics for robust robotic grasp planning.
It integrates 1,500 3D object models with detailed noise simulation and parallel-jaw grasp sampling to provide supervisory data for deep neural network training.
The dataset supports both synthetic validation and real-world trials, achieving high classification accuracy and success rates in rapid grasp planning benchmarks.

Dex-Net 2.0 is a large-scale synthetic dataset developed for robust robotic grasp planning. It consists of 6.7 million datapoints, each comprising a rendered point cloud, a parallel-jaw grasp configuration, and an analytic grasp robustness metric. Constructed from 1,500 3D object models in thousands of randomized poses, Dex-Net 2.0 was specifically designed to provide supervisory data for data-driven grasp quality prediction using only depth images. This dataset enables rapid and reliable grasp planning with deep neural networks, bridging the gap between analytic and empirical approaches in robotic manipulation (Mahler et al., 2017).

1. Dataset Construction Pipeline

3D Model Collection and Preparation

Dex-Net 2.0 incorporates 1,500 object meshes: 1,371 synthetic models from 3DNet and 129 laser scans from the KIT database. Each mesh is aligned to its principal axes, scaled to fit within a 5 cm gripper aperture, and assigned unit mass. Stable object poses are computed using quasi-static equilibrium analysis, with low-probability (unstable) poses excluded.

Synthetic Point Cloud Generation

For every stable pose of each object, a camera is virtually positioned by sampling a viewpoint in spherical coordinates:

Radial distance $r \sim U[0.65, 0.75]$ m
Azimuth $\varphi \sim U[0, 2\pi]$
Polar tilt $\theta \sim U[0.057, 0.17]$ radians

A depth image is then rendered using a pinhole camera model and perspective projection. To simulate sensor noise, each pixel $\bar{y}$ is corrupted as $y = \xi \cdot \bar{y} + \epsilon$ , where $\xi$ is sampled from a Gamma distribution (parameters $\alpha=1000.0$ , $\beta=0.001$ ) for multiplicative (depth-proportional) noise and $\epsilon$ is sampled from a zero-mean Gaussian Process (kernel bandwidth $\ell=\sqrt{2}$ pixels, $\sigma=0.005$ m per pixel).

Grasp Sampling and Representation

Parallel-jaw grasps are parameterized by $u=(p, \theta, h)$ , where $p \in \mathbb{R}^2$ (image-plane grasp center), $\theta \in [0, 2\pi)$ (jaw opening rotation in the table plane), and $h$ (gripper approach height, sampled at 1 cm granularity). Candidate grasps are generated by uniform rejection sampling of antipodal point pairs on the object surface that satisfy the friction-cone grasp criterion:

For candidates $x_1, x_2$ with normals $n_1, n_2$ , $(x_2 - x_1) \cdot n_1 < 0$
Both contact normals must lie within an angle $\theta_c = \arctan \mu$ of the grasp axis, initially using friction coefficient $\mu = 0.6$ (increased if insufficient grasps are found).

Analytic Grasp Quality Metrics

For each grasp $u$ and state $x=(O, T_o, T_c, \mu)$ , two metrics are computed:

Collision-free status: $coll\_free(u, x) = 1$ if the gripper does not penetrate object or table.
Robust $\epsilon$ -metric (Ferrari–Canny quality under uncertainty): For each perturbation sample $x'$ , compute $\epsilon(u, x')$ ; the robust value is $Q(u) = \mathbb{E}_{x'\sim Uncertainty}[\epsilon(u, x')]$ .

A grasp is labeled successful (label $S=1$ ) if $coll\_free(u, x) = 1$ and $Q(u) > \epsilon_0$ , with $\epsilon_0 = 0.002$ m $\cdot$ N.

2. Dataset Structure and Statistical Properties

Total datapoints: 6.7 million $(\mathbf{y}, u)$ pairs, where each datapoint includes a rendered depth image, grasp parameters, a binary success label $S \in \{0,1\}$ , and robustness score $Q(u)$ .
Object models: 1,500 unique objects.
Grasps per object: Up to 100 antipodal grasps for each stable pose ( $\mathcal{O}(500)$ candidates per pose, with 100 subsampled).
Positive label (successful grasps) fraction: $\approx 21.2\%$ (approximately 1.42 million).
Datapoint format: Each contains (a) a $32 \times 32$ “grasp image” rotated/translated/cropped so that the grasp center is image-center and jaw axis aligns with the center row, (b) scalar grasp height $h$ , and (c) label $S$ .
Storage: Data are organized in HDF5 files (per batch), with separate datasets for image patches, heights, and labels.

Attribute	Value / Description	Notes
Total datapoints	6.7 million	(depth image, grasp, label, Q)
Object models	1,500	1,371 synthetic, 129 scanned
Grasp candidates	≤ 100 per stable pose; ~500 sampled per pose	Antipodal, friction cone
Successful fraction	≈21.2%	~1.42M positive labels
Image patch size	$32 \times 32$	Aligned to grasp/jaw axis

3. Training and Evaluation Protocols

Grasp Quality CNN (GQ-CNN)

Dex-Net 2.0 is designed for training the GQ-CNN, which predicts the probability of success for a grasp from a $32\times32$ depth patch and scalar gripper height $h$ .

Architecture: Four convolutional layers (size: $7\times7$ , $5\times5$ , $3\times3$ , $3\times3$ ; 64 filters each; ReLU activations; first layer has local response normalization), followed by three fully connected layers (1024, 1024, 2 units; height $h$ fused via separate tower), totaling approximately 18 million parameters.
Output: Probability $\hat{S} = \hat{Q}(u|y_p, h) \in [0,1]$ .
Loss: Cross-entropy between label $S$ and prediction $\hat{S}$ .
Optimization: SGD with momentum 0.9, batch size 128, learning-rate exponential decay, weights initialized $\sim \mathcal{N}(0, 2/n_i)$ .
Data augmentation: Rotation (multiples of $90^\circ$ ), horizontal/vertical flips, on-the-fly noise injection (matching the rendering noise model).

Evaluation Protocols

Synthetic classification: Use 20% held-out synthetic data for validation; report accuracy and ROC AUC.
Physical grasping benchmarks: ABB YuMi robot with silicone fingertip grippers, singulated objects, input is full depth + bounding box, pipeline includes sampling $O(400)$ $O (400)$ antipodal grasps, scoring by GQ-CNN, filtering for reachability/collisions.
- Known objects: 8 adversarial 3D-printed models (80 trials)
- Novel objects: 10 household items (50 trials)
Metrics: Success rate (number of successful attempts / number attempted), precision (successes / grasps with $\hat{S} \geq 0.5$ ), planning time (from image to grasp command).

4. Empirical Performance

Synthetic Dataset Classification

GQ-CNN trained on the full dataset achieves $\approx 85.7\%$ accuracy on held-out synthetic validation data.

Real-World Grasping Trials

Known objects (8 adversarial): 93% ± 6% success rate, 94% precision, 0.8s average planning time (GQ-CNN planner, greedy policy).
Baseline comparison: Point-cloud registration planner attains 95% ± 5% but requires 2.6s (≈3× slower); image heuristics and ML baselines (ML-RF, ML-SVM) achieve 70% and 75–80% success, respectively.
Novel objects (10 household items): 80% ± 11% success rate, 100% precision, 0.8s planning.
Generalization (40 “in-the-wild” objects): Using Cross-Entropy Method (CEM) optimization, success is 94%, precision 99%, planning 2.5s.

System Integration

Dex-Net 2.0, combined with a push-separation policy, enabled ABB YuMi to correctly pack 3 target objects in 4 out of 5 order-fulfillment trials, yielding a 93% grasp success rate over 27 attempts.

5. Impact and Availability

Dex-Net 2.0 demonstrates that large-scale, fully synthetic datasets can enable robust and efficient grasp planning from depth images, rivaling registration-based methods in reliability while offering a 3× reduction in planning latency. The dataset and its associated GQ-CNN pipeline support research into analytic learning for robotic manipulation and provide a standardized benchmark for robust, generalizable grasp planning. Code, dataset, and supporting materials are accessible at http://berkeleyautomation.github.io/dex-net (Mahler et al., 2017).

6. Relevance to Robotic Manipulation Research

Dex-Net 2.0 is used to investigate learning-based grasp evaluation under analytic supervision, bridging the gap between traditional analytic metrics (such as Ferrari–Canny’s $\epsilon$ -metric) and deep neural perception. The dataset’s scale supports direct empirical comparison of learning-based approaches to classical registration pipelines and enables systematic benchmarking of accuracy, precision, planning speed, and generalization to novel, articulated, and deformable objects.

A plausible implication is that large-scale synthetic pipelines—integrating CAD models, analytic criteria, noise modeling, and deep learning—may set a new paradigm for training and evaluating perception-driven robotic manipulation algorithms.

Markdown Report Issue Upgrade to Chat

References (1)

Dex-Net 2.0: Deep Learning to Plan Robust Grasps with Synthetic Point Clouds and Analytic Grasp Metrics (2017)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dex-Net 2.0 Dataset.