Industrial Metallic Dataset (IMD)

Updated 22 September 2025

Industrial Metallic Dataset (IMD) is a benchmark of 45 true-to-scale metallic industrial objects with CAD models for rigorous segmentation and 6D pose estimation testing.
It supports tasks such as video object segmentation, 6D pose tracking, and one-shot pose estimation using varied camera trajectories and natural indoor lighting.
The dataset exposes current algorithm limitations in reflective, low-texture environments and drives future research in robust industrial robotic perception.

The Industrial Metallic Dataset (IMD) is a benchmark for object segmentation and 6-degree-of-freedom (6D) pose estimation tasks tailored to the challenges of industrial robotics scenarios involving metallic, texture-less, and highly reflective components. The dataset is specifically designed to expose and quantify the limitations of existing algorithms—primarily developed on household or everyday items—in industrial contexts where robust perception is essential for manipulation and automation.

1. Dataset Composition and Acquisition

IMD comprises 45 true-to-scale metallic industrial objects representative of machine-tending environments. Object diameters span from 1.94 cm to 13.2 cm (mean ≈ 5.71 cm, standard deviation ≈ 2.62 cm), with each object accompanied by a CAD model, enabling both annotation and the potential for synthetic data generation.

Data acquisition utilizes an Intel RealSense D405 RGB-D camera (1280 × 720 px, 87° × 58° FOV, 7–50 cm range) under natural indoor daylight. The dataset explores a range of object arrangements reflecting real-world industrial scenarios: isolated single objects, groups of similarly shaped items, mixed random groups, and scenarios containing all objects in a cluttered assembly. Objects are placed atop a matte-gray surface resembling conveyor belts.

Camera trajectories are systematically varied:

A top-down view traverses a square path, capturing 200 frames per sequence at 0.03 s intervals.
Surround views employ a circular path at a 45° inclination, also capturing 200 frames at 0.05 s intervals.

The finished dataset contains 55 distinct scenarios corresponding to 110 videos and 256 object sequences with full annotation.

2. Benchmark Tasks and Methodologies

IMD supports three major tasks critical for robotic perception:

Task Type	Evaluated Methods	Metric(s)
Video Object Segmentation	XMem, SAM2	Intersection over Union (IoU)
6D Pose Tracking	BundleTrack, BundleSDF	Translation Error (mm), Rotation Error (°)
One-Shot 6D Pose Estimation	BundleTrack, BundleSDF (partial memory lockout)	Translation Error, Rotation Error

a) Video Object Segmentation prompts models with an annotated mask in the first frame, requiring mask propagation throughout the video. IoU (Jaccard Index) quantifies segmentation quality: $IoU = \frac{|B_{p} \cap B_{gt}|}{|B_{p} \cup B_{gt}|}$ where $B_{p}$ and $B_{gt}$ are predicted and ground-truth masks, respectively.

b) 6D Pose Tracking requires temporal estimation of object pose via RGB-D input. Ground-truth object pose in the camera frame is determined using

$T_{c}^{o} = (T_{w}^{c})^{-1} T_{w}^{o}$

with homogeneous representations (rotation $R$ , translation $t$ ). Translation error is computed as Euclidean centroid distance; rotation error is the angular difference between ground-truth and prediction.

c) One-Shot 6D Pose Estimation simulates practical scenarios with minimal prior context. Models are initialized on the first half of the sequence, then forced to estimate pose on subsequent frames with memory updates disabled. This isolates robustness to unseen views and changing appearances in single observations.

3. Evaluation and Comparative Analysis

The IMD is found to be significantly more challenging than household-focused benchmarks.

Segmentation: On DAVIS-2017 (household objects), XMem and SAM2 achieve IoU mean scores of 0.863 and 0.893 (recall at 0.5 IoU = 1.0). On IMD, IoU drops to 0.746 (XMem, recall 0.922) and 0.770 (SAM2, recall 0.980). SAM2 demonstrates greater overall resilience to specular and textureless artifacts, visible in tighter IoU error distributions.
6D Pose Tracking: On YCB-video, BundleTrack and BundleSDF score translation errors of 2.26 mm and 5.64 mm, and rotation errors of 4.48° and 8.09° respectively. On IMD (top-down), BundleTrack increases to 6.61 mm and 8.12°, BundleSDF to 8.82 mm and 13.08°. These error rates further deteriorate in the angled view (BundleTrack: 32.23 mm, 49.17°). BundleTrack demonstrates tighter error distributions and lower variance.
One-Shot Pose Estimation: BundleSDF is more robust than BundleTrack in strictly memory-isolated settings; however, both methods show substantial error increases compared to their full-tracking results (e.g., + 20.6% translation error, + 117.7% rotation error on YCB-video for BundleSDF). The IMD amplifies difficulty due to increased view and lighting variability.

The consistent upsurge in error across all tasks and models on IMD underscores the impact of high specular reflections, ambiguous contours, and texture deficiency.

4. Technical Challenges of Industrial Metallic Objects

IMD’s design strategically foregrounds visual phenomena prevalent in industrial robotics:

Reflectivity leads to unreliable or missing depth measurements in typical RGB-D sensors.
Low Texture impairs feature matching algorithms (e.g., Lf-Net in BundleTrack, LoFTR in BundleSDF).
Occlusion and Variable Illumination provoke drastic appearance changes, especially in angled camera trajectories, severely affecting pose estimation stability.
Size and Arrangement Diversity challenge spatial generalization and multi-object localization.

The dataset enforces explicit confrontation with feature sparsity, shadow artifacts, specular hotspots, and pose ambiguities intrinsic to authentic industrial environments.

5. Implications for Robotic Perception and Industrial Automation

Current algorithms, while effective on conventional benchmarks, exhibit substantial performance degradation on IMD. This result illuminates a critical gap: household-derived training fails to generalize to the metallic, textureless, and cluttered domains encountered in advanced manufacturing and automation.

IMD therefore functions both as a diagnostic tool for existing segmentation and pose models and as a challenge for future techniques. Precise 6D pose estimation and segmentation of metallic objects are pivotal for:

Robotic manipulation (e.g., bin picking, assembly)
Pose-based process monitoring
Autonomous machine-tending in conveyor-driven industrial lines

Robustness to variable lighting, specular reflection, and low-contrast scenarios remains an unsolved problem that IMD sharply delineates.

6. Prospects for Future Research and Dataset Expansion

The IMD benchmark motivates several research avenues:

Algorithmic advances in segmentation and pose estimation that leverage cues invariant to highlight and texture distortions
Development of novel feature descriptors and sensor fusion techniques targeting RGB-D failure modes
Synthetic data integration and generative refinement for training robustness
Extension of IMD with broader arrangements, diverse lighting regimes, and additional industrial materials and components

The paper encourages IMD’s adoption as a baseline for industrial perception methods and as a rigorous standard for hypothesis refinement in transformer-based, diffusion, or foundation model frameworks.

7. Contextual Role Among Industrial Benchmarks

IMD augments and complements existing datasets such as BIDCD (Botach et al., 2021), the Dataset of Industrial Metal Objects (Roovere et al., 2022), and HSS-IAD (Wang et al., 17 Apr 2025), each addressing specific aspects of industrial imaging and anomaly detection. IMD’s emphasis on metallic objects, real canonical CAD correspondence, and multi-model evaluation situates it as a keystone resource for research targeting the domain transfer problem—from consumer-oriented training to industrial deployment.

It provides critical ground truth for segmentation, pose-tracking, and single-shot estimation under conditions emblematic of real-world manufacturing, catalyzing development and comparison of next-generation industrial vision algorithms (Ma et al., 15 Sep 2025).