This paper introduces GraspClutter6D, a large-scale, real-world dataset designed to advance robotic perception and grasping, particularly in highly cluttered environments. The authors argue that existing datasets often feature simplified scenes with limited occlusion and diversity, hindering the development of robust systems for practical applications like warehouse automation or household assistance. GraspClutter6D aims to bridge this gap.
Key Contributions and Dataset Details:
- Scale and Complexity: The dataset includes 1,000 unique scenes featuring 200 distinct objects (household, warehouse, industrial). It contains 52,000 RGB-D images captured by four different sensors. The scenes are densely packed, averaging 14.1 object instances per image with a high average occlusion rate of 62.6%, significantly higher than previous datasets like GraspNet-1B (8.9 instances, 35.2% occlusion).
- Diverse Environments: Data was collected in 75 different configurations spanning bin, shelf, and table environments, using various background materials to increase diversity.
- Multi-Sensor and Multi-View: A UR5 robot arm equipped with four RGB-D cameras (RealSense D415, D435, Azure Kinect, Zivid One+ M) captured each scene from 13 viewpoints (1 center, 12 peripheral). This provides varied fields of view, illumination, depth characteristics, and viewpoint coverage (mean 67.6%).
- Rich Annotations: The dataset provides extensive annotations:
- 736,000 instance-level 6D object poses and segmentation masks.
- 9.3 billion feasible 6-DoF grasp poses for a parallel-jaw gripper (avg. 178K per image).
- High-Quality Data and Annotations:
- Object Models: 200 objects were included (108 custom scanned with Artec Leo, 92 from existing benchmarks like YCB). High-quality, watertight 3D textured models were generated, with reflective/transparent objects sprayed to capture accurate geometry.
- Calibration: Rigorous intrinsic, extrinsic (camera-to-camera and camera-to-robot), and depth calibration (reducing errors for low-cost sensors) was performed.
- Pose Annotation: A custom tool with quality monitoring (target < 5mm mean depth error) and crowd-sourcing was used. Annotations were done on high-resolution integrated point clouds (from Zivid) and propagated to all views. The achieved accuracy (e.g., mean absolute depth difference μ∣δ∣ of 3.22mm for Zivid) is comparable or better than existing datasets.
- Grasp Annotation: A two-stage process similar to GraspNet-1B was used: object-level force-closure grasp sampling followed by scene-level projection and collision checking against the reconstructed scene point cloud.
- Public Availability: The dataset, annotation tools, data processing toolkit, and object purchase links are publicly available.
- Standardized Splits: Predefined splits are provided for cross-object generalization testing (training on 132 objects, testing on 68 unseen YCB-HOPE objects) and intra-object testing (focused on 21 common YCB-Video objects).
Experiments and Benchmarking:
The paper validates the dataset's utility through several experiments:
- Training Resource for Grasp Detection:
- Setup: Contact-GraspNet was trained separately on ACRONYM (synthetic), GraspNet-1B (real, less clutter), and GraspClutter6D (real, high clutter). Performance was evaluated in simulation (PyBullet) and real-world robot experiments (UR5, Robotiq gripper) using packed and pile scenes with varying object counts. AnyGrasp (trained on the extended GraspNet-1B++) was used as a state-of-the-art baseline.
- Results: Contact-GraspNet trained on GraspClutter6D significantly outperformed models trained on ACRONYM and GraspNet-1B in both simulation (e.g., 77.3% vs. 71.0% GSR in 15-object pile) and real-world tests (e.g., 68.5% vs. 51.1% GSR in 15-object pile). It also outperformed AnyGrasp in cluttered scenarios, demonstrating the value of training on diverse, highly cluttered real-world data.
- Instance Segmentation Benchmark:
- Models: Mask R-CNN, Cascade Mask R-CNN, Mask2Former (trained on GraspClutter6D cross-object split), and Grounded-SAM (zero-shot foundation model).
- Results: Mask2Former achieved the best performance (AP 43.5). Grounded-SAM showed high recall but poor precision, indicating difficulties in precise boundary delineation in clutter. This suggests domain-specific training is still crucial for high performance in such environments.
- 6D Object Pose Estimation Benchmark:
- Models: FFB6D, GDR-Net (trained on GraspClutter6D intra-object split), MegaPose, FoundationPose (pre-trained foundation models). Tested on YCB-Video objects within GraspClutter6D.
- Results: Foundation models (FoundationPose: 70.5 ADD(-S), MegaPose: 69.6 ADD(-S)) significantly outperformed specialized models trained only on the dataset split. Performance degraded considerably for all models as occlusion increased, highlighting occlusion as a major challenge.
- 6-DoF Grasp Detection Benchmark:
- Models: Contact-GraspNet, GraspNet-Baseline, ScaleBalancedGrasp, EconomicGrasp (all trained on GraspNet-1B). Evaluated on both GraspNet-1B and GraspClutter6D test sets.
- Results: All methods showed a substantial drop in performance (e.g., EconomicGrasp AP dropped from 51.63 on GraspNet-1B to 19.02 on GraspClutter6D). This confirms GraspClutter6D poses a significant challenge for current grasp detection methods, particularly due to clutter and diverse backgrounds/environments. ScaleBalancedGrasp performed slightly better than EconomicGrasp on GraspClutter6D, potentially due to its auxiliary segmentation helping filter background grasps.
Practical Implementation:
- Training Data: Developers can use GraspClutter6D to train more robust perception (segmentation, pose estimation) and grasping models that generalize better to real-world clutter, potentially reducing the sim-to-real gap compared to purely synthetic datasets.
- Benchmarking: The dataset provides a challenging benchmark for evaluating and comparing new algorithms designed for cluttered scenes. The standardized splits facilitate direct comparisons.
- Tools: The provided toolkit and annotation tools can aid researchers in utilizing the dataset and potentially annotating their own data.
- System Design: The results emphasize the need for algorithms that explicitly handle heavy occlusion and diverse environmental factors (bins, shelves, varied backgrounds) which are often simplified in other datasets. The multi-sensor data allows exploration of sensor fusion or evaluating robustness across different sensor types.
In conclusion, GraspClutter6D offers a valuable resource for the robotics community by providing a large-scale, diverse, and challenging real-world dataset focused on cluttered environments. Its comprehensive annotations and benchmarking highlight current limitations and provide a strong foundation for developing next-generation robotic manipulation systems.