Papers
Topics
Authors
Recent
Search
2000 character limit reached

ShapeR Evaluation Dataset

Updated 23 January 2026
  • The ShapeR Evaluation Dataset is a curated benchmark for conditional 3D object shape generation featuring real-world image sequences with challenges like uncontrolled lighting, motion blur, occlusion, and clutter.
  • It leverages advanced SLAM, instance detection, and a rigorous annotation protocol to achieve high geometric fidelity, evidenced by low Chamfer distances (2.38×10⁻²) and high F1 scores (0.722).
  • The dataset spans 178 diverse objects across multiple indoor scenes, providing a robust testbed that exposes failure modes in traditional, segmentation-dependent 3D reconstruction methods.

The ShapeR Evaluation Dataset is a curated benchmark for conditional 3D object shape generation from real-world, casually captured image sequences. Designed specifically to address the limitations of prior 3D shape benchmarks—many of which assume controlled acquisition and well-segmented objects—the ShapeR Evaluation Dataset introduces challenging scenarios such as uncontrolled lighting, motion blur, occlusion, and dense clutter, making it a critical testbed for robust 3D object reconstruction and generative modeling methodologies (Siddiqui et al., 16 Jan 2026).

1. Dataset Composition and Coverage

The benchmark comprises 178 unique object instances extracted from 7 distinct indoor scene traversals. Each scene typically yields 20 to 30 annotated objects, with semantic coverage including furniture (chairs, side tables), household electronics (toasters, remotes), kitchenware and décor (mugs, vases), as well as tools and miscellaneous items (screwdrivers, lamps). The object set is intentionally diverse in terms of real-world variability:

  • Object size spans from small (~5 cm, e.g., remotes) to large items (>0.5 m, e.g., chairs).
  • Shape complexity includes both simple geometric primitives and highly nonconvex objects (such as ornate lamps).
  • Capture conditions expose models to challenges such as severe occlusion (e.g., bookshelves, stacks), uncontrolled and varying illumination, motion blur, and background clutter.

This extensive diversity is intended to expose failure modes in methods reliant on clean segmentation or ideal acquisition settings, a known shortcoming of prior benchmarks.

2. Data Acquisition and Preprocessing Workflow

Acquisition utilizes Meta Project Aria Gen 1/Gen 2 AR glasses equipped with monochrome stereo cameras and IMU sensors. Users traverse each scene freely for 30 seconds to 2 minutes, producing high-frequency (20–30 Hz) image sequences at up to 1280×720 resolution. The processing pipeline is as follows:

  • Visual-inertial SLAM (Direct Sparse Odometry [Engel et al. 2017], via Aria MPS) reconstructs semi-dense metric point clouds and recovers camera intrinsics/extrinsics for every frame.
  • Instance detection (EFM3D [Straub et al. 2024]) localizes 3D bounding boxes for object candidates, generating subsets of the SLAM point cloud (P_i) for each object.
  • Point refinement is performed using SAM2, which discards outlier points from each instance’s detection.
  • View and pose selection: For each object, all frames in which its SLAM points are visible are identified; 8–16 representative images (I_i) are chosen to maximize viewpoint diversity. Binary masks (M_i) result from projecting each P_i into the image plane.
  • Captions (T_i) are generated by feeding a representative object image to a frozen Llama 4 vision–LLM (Meta 2025), yielding short descriptive sentences such as “red toaster on cluttered table”.

All images used for evaluation are resized to 280×280 pixels, inheriting the full calibration parameters for object-centric, multi-view analysis.

3. Geometry Annotation Protocol and Quality

Ground-truth geometry annotation is conducted in four detailed steps:

  1. Physical relocation: Each object is moved to a clutter-free area and photographed at high resolution.
  2. Manual segmentation: The isolated object is segmented in this clean view.
  3. Image-to-3D pipeline under ideal conditions: A state-of-the-art image-to-3D system reconstructs a complete mesh (S_i).
  4. Rigid alignment and refinement: Using a web tool, annotators align S_i to match object pose and scale in the original casual sequence, verified by SLAM-point projection into multiple frames.

This rigorous protocol yields high geometric fidelity, with 2D reprojection errors typically below 2 pixels and metric alignment errors under 1 cm relative to the original SLAM point cloud.

4. Dataset Splits and Benchmarking Procedures

The primary usage scenario provides all 178 annotated objects as a held-out evaluation set. Multiple research splits are prescribed for assessing generalization:

  • Cross-scene generalization: Leave-one-scene-out (train on 6, test on 1).
  • Cross-category evaluation: Hold out all objects from a semantic category (e.g., “kitchenware”) at test time.
  • Random per-object splits: Randomly partition the 178 objects into 80% train, 10% validation, and 10% test, facilitating large-scale experimental comparisons.

For benchmarking, each test scene requires: multi-view images, SLAM point clouds, and full camera calibration. 3D reconstruction or generative methods are executed in an object-centric mode, and output meshes S^i\hat{S}_i are quantitatively compared to ground-truth meshes SiS_i.

5. Evaluation Metrics and Quantitative Results

Three principal shape evaluation metrics are used:

  • Chamfer ℓ₂ Distance (CD):

CD(S,S^)=1SxSminyS^xy2+1S^yS^minxSxy2\mathrm{CD}(S, \hat{S}) = \frac{1}{|S|} \sum_{x \in S} \min_{y \in \hat{S}} \|x - y\|^2 + \frac{1}{|\hat{S}|} \sum_{y \in \hat{S}} \min_{x \in S} \|x - y\|^2

  • Normal Consistency (NC): Mean cosine similarity between normals of corresponding points.
  • F-score at 1% threshold: Harmonic mean of precision and recall, with a 0.01 normalized distance threshold.

Average metric results (over 178 objects):

Method CD (×10⁻²) ↓ NC ↑ F1 ↑
EFM3D (scene fusion) 13.82 0.614 0.276
FoundationStereo (TSDF) 6.48 0.677 0.435
LIRM (segmentation) 8.05 0.683 0.384
DP-Recon (mask-based) 8.36 0.661 0.436
ShapeR (full model) 2.38 0.810 0.722

Ablation studies reveal substantial degradation when key modalities or augmentation strategies are omitted (e.g., removing SLAM points increases CD from 2.38 to 4.51).

6. Key Challenges and Methodological Insights

Critical dataset features present unique and non-trivial challenges:

  • Viewpoint sparsity and motion blur: Instances with very few high-quality images or significant camera jitter produce low-detail, incomplete reconstructions.
  • Physical contact and stacking: When objects are adjacent or in contact, geometry “bleeds” between neighbors, confounding object-centric recovery.
  • Missed or incorrect detections: Failure in 3D instance detection results in missing object reconstructions at inference time.
  • Background clutter and partial occlusions: These significantly degrade the performance of segmentation-dependent algorithms, leading to mask errors and missing regions.

Sparse SLAM point aggregation substantially mitigates the impact of clutter and occlusion by robustly encoding 3D geometry over the entire sequence rather than relying solely on per-frame evidence.

7. Relation to Other 3D Shape Datasets and Benchmarks

The ShapeR Evaluation Dataset distinguishes itself by targeting casual, in-the-wild acquisition, unlike synthetic benchmarks or controlled lab datasets. In contrast to the SHARP 2020 and human shape retrieval challenges, which predominantly focus on texture-complete scans or nonrigid body variation under studio conditions with rigorous subject separation and partial data simulation (Saint et al., 2020, Pickup et al., 2020), ShapeR’s protocol emphasizes realistic, highly variable object appearance and geometric ambiguities present in uncurated, everyday settings. Its evaluation suite thus exposes new robustness bottlenecks absent from prior object and human shape benchmarks and better supports the development of practical, generalizable 3D generative shape reconstruction methods (Siddiqui et al., 16 Jan 2026).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ShapeR Evaluation Dataset.