SceneFun3D: 3D Functional Part Benchmark
- SceneFun3D is a large-scale, annotated dataset that provides detailed 3D reconstructions, segmentation masks, and functional scene graphs for indoor environments.
- It incorporates multi-modal data with up to 13 million points per scene, enabling precise localization and segmentation of tiny functional object parts based on natural language commands.
- The extended version introduces functional 3D scene graphs, advancing open-vocabulary reasoning and facilitating research in robotic manipulation and embodied AI.
SceneFun3D is a large-scale dataset designed to benchmark fine-grained functionality understanding and open-vocabulary segmentation in real-world 3D indoor environments. The dataset was developed to facilitate research into the localization and identification of functional interactive object parts—such as handles, knobs, buttons, and switches—based on natural-language task descriptions. SceneFun3D includes a comprehensive annotation schema focused on functional subparts within domestic and office scenes, high-density 3D reconstructions, and multi-modal sensory data. An extended version of SceneFun3D introduces functional 3D scene graphs, enabling the paper of object-part relationships and open-vocabulary reasoning.
1. Dataset Composition and Scope
SceneFun3D comprises 230 real-world indoor scenes captured from environments including bedrooms, living rooms, kitchens, and paper areas. Each scene is densely reconstructed with up to 13 million 3D points (split-dependent), sourced from multi-view high-resolution RGB images and registered depth maps. The dataset provides per-scene:
- Fused 3D point clouds (
.plyor.pcd). - Multi-view RGB frames in PNG format.
- Registered depth maps.
- Precise camera intrinsics and extrinsics (poses).
- Annotations in JSON format, linking natural-language task descriptions to 3D masks.
Task descriptions average 15 per scene, totaling just over 3,000 natural-language commands. These commands require the agent to infer, identify, and segment fine-grained functional parts, which are typically small fractions (<0.5%) of total scene points, often described without explicit sub-object names. Functional categories include handles (drawer, cabinet), knobs (radiator, volume), buttons, and switches, with nearly uniform coverage across scenes: drawer handles (~30%), light switches (~20%), door handles (~15%), knobs (~10%), and miscellaneous buttons (~25%). Split 0 contains 30 scenes with up to 8 million points each, while Split 1 contains 200 denser and more cluttered scenes with up to 13 million points each.
2. Annotation Methodology and Task Definition
SceneFun3D annotations are performed by expert human annotators using interactive segmentation tools on multi-view RGBD data:
- Each natural language command (e.g., “open the bottom drawer of the nightstand with the red lamp on top”) defines a task of localizing the relevant “functional object” (sub-part F) and the host “parent object” (O).
- Ground-truth 3D masks mark all points belonging to the functional part(s) referenced in the command. For tasks referring to multiple identical parts (e.g., “turn the knobs on the stove”), all relevant instances are masked.
- Multi-view masks are lifted to the point cloud and cross-checked for consistency. Ambiguous annotation cases are resolved by manual review.
- Each command is semantically mapped to its F–O pair in metadata.
The extended SceneFun3D dataset introduces annotations for functional 3D scene graphs. Twenty selected scenes (8 validation, 12 test) from the original set receive additional annotations specifying not only the segmentation masks, but also the explicit directed functional relationships (edges) between interactive elements and objects. The graph schema is defined as , where denotes objects, interactive elements, and directed edges capturing the functional relationship as a natural-language label (e.g., “opens,” “powers”).
3. Data Modalities, Formats, and Access
SceneFun3D provides multiple sensor modalities per scene:
- High-density, fused 3D point clouds.
- Multi-view RGB frames (PNG) and depth maps.
- Exact camera models (poses as TXT or JSON).
- Per-point or per-index functional 3D masks.
- Functional graph JSON files in the extended version, encoding node metadata (type, label, 3D bounding box, associated points, view-masks) and relationship edges (element, object, free-form label, confidence for remote links).
No further preprocessing is needed beyond standard pose–depth–color alignment; RGBD images and point clouds are co-registered and undistorted. SceneFun3D and its extensions are available for public download via their respective project pages, with sample scripts for visualization and usage.
Example loading and visualization of the graph structure makes use of common libraries, such as Open3D for rendering point clouds and bounding boxes and standard Python JSON utilities for parsing annotation files.
4. Benchmarking Protocols and Performance Metrics
SceneFun3D evaluation uses standard and adapted 3D instance segmentation metrics:
- Average Precision at IoU threshold (), reporting () and (), as well as mean (mAP) over .
- Average Recall (, mAR).
- Mean Intersection-over-Union (mIoU), .
- Precision@k for retrieval-based evaluations: .
The extended SceneFun3D introduces Recall@K metrics for graph-based tasks, such as Node Recall@K (), Triplet Recall@K (), Node-Association Recall (), and Edge-Prediction Recall (), with open-vocabulary retrieval using CLIP (for node labels) and BERT (for relation labels). Reported results include overall node R@3 of 73.0% (object: 81.8%; element: 71.0%) and triplet R@5 of 60.4%. Comparative baselines (e.g., Open3DSG, ConceptGraph) are significantly outperformed by the foundation-model-driven approaches.
5. Extension: Functional 3D Scene Graphs
The functional scene graph extension annotates selected scenes with explicit graphs capturing object-part relational structure. Nodes are categorized as objects (e.g., cabinet, TV) or interactive elements (e.g., handles, switches), and relationships are either “local” (rigidly attached, e.g., door–handle) or “remote” (operates at a distance, e.g., wall switch → ceiling light). Open-vocabulary labeling is used for both nodes and relationships, supporting unconstrained linguistic expressions.
Annotations are performed via a specialized interface, enabling annotators to traverse the 3D point cloud, assign labels, and draw functional edges. Human-in-the-loop procedures include referencing egocentric interaction videos to resolve ambiguities, particularly for remote-control relationships. No formal metrics for inter-annotator agreement are reported.
The released data include per-frame segmentation masks, bounding boxes, and JSON scene graphs.
6. Downstream Applications and Research Implications
SceneFun3D and its extensions enable several research directions:
- Functional Part Segmentation: Direct benchmarking of functional part localization and segmentation driven by natural language in dense, cluttered, real-world 3D scenes.
- Embodied AI and Task Execution: Supports development of agents that must reason over spatial and linguistic cues, e.g., inferring to “flip the light switch” to satisfy “turn on the ceiling light.”
- Functional Scene Graph Reasoning: Facilitates functional 3D scene graph prediction, supporting queries that require identifying both objects and controlling interactive elements, critical for robotic manipulation and 3D question answering.
- Foundation Model Evaluation: Benchmarks open-vocabulary, training-free pipelines leveraging visual and language foundation models, with performance metrics showing strong recall and reasoning benefits from strategies such as GroundingDINO prompt design and sequential reasoning.
The dataset structure and evaluation protocols provide a rigorous testbed for spatial-linguistic and functional reasoning, particularly valuable for developing manipulation-capable agents in real-world, unstructured environments.
7. Dataset Statistics and Notable Properties
The dataset’s fine-grained focus is underscored by key statistics:
| SceneFun3D Property | Split 0 | Split 1 |
|---|---|---|
| Number of scenes | 30 | 200 |
| Average task descriptions per scene | 15 | 15 |
| Typical point cloud size (millions) | ≤8 | ≤13 |
| Average RGBD frames per scene | 1,800 | 1,800 |
| Average functional part size (% total) | <0.5% | <0.5% |
| Scene density and clutter | Lower | Higher |
Category frequencies are consistent across splits: drawer handles (∼30%), light switches (~20%), door handles (~15%), knobs (~10%), and buttons (~25%). Average per-scene complexity is higher in Split 1, reducing baseline performance due to increased clutter.
Histograms confirm the distribution of tasks per scene is tightly peaked at 15; category coverage avoids dominance by any single functional type. Box-and-whisker visualizations of point-cloud density illustrate the challenge posed by the large number of points and the minute fraction occupied by functional targets.
A plausible implication is that SceneFun3D’s structure and annotation density make it a uniquely difficult benchmark for current and future open-vocabulary segmentation methods, demanding advances in both geometric understanding and pragmatic language interpretation.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free