3DSSG: 3D Scene Graph Dataset

Updated 22 November 2025

The 3DSSG dataset is a comprehensive semantic graph resource that represents real-world 3D scenes using RGB-D scans and structured scene graphs.
It features detailed annotations of objects and spatial relationships, enabling advanced graph-based learning and relational reasoning.
The dataset supports multiple research tasks such as semantic segmentation, scene graph generation, and object detection with standardized benchmarks.

The 3DSSG dataset is a large-scale semantic graph resource constructed over real-world 3D scenes, designed for the study and evaluation of 3D semantic scene understanding, object recognition, and spatial reasoning. It enables research into structured graph-based approaches for representing complex spatial relationships between objects in three-dimensional environments, supporting downstream tasks such as semantic segmentation, scene graph generation, and relational reasoning.

1. Structural Overview and Data Schema

The 3DSSG dataset defines a consistent scene-graph formalism over RGB-D indoor scans. Each individual scene is represented as a directed attributed graph $G=(V,E)$ , where the vertex set $V$ corresponds to physical objects (e.g., chair, table) and the edge set $E$ encodes semantic and geometric relationships observed in the 3D environment. Node attributes include fine-grained semantic labels, 3D geometric parameters (bounding boxes, position, orientation), and potentially visual features extracted from RGB or depth modalities. Edge attributes represent relationship predicate types (e.g., "on-top-of," "next-to," "part-of") enriched with pairwise spatial statistics.

The 3DSSG schema supports multi-type node (object category) and edge (relation) taxonomies, permitting integration with both single-label and multi-label scene parsing pipelines in machine learning frameworks. Each scan annotation provides jointly consistent object instances and predicate instances, ensuring referential integrity between object proposals and their associated relationships.

2. Data Acquisition and Annotation Protocol

3DSSG is constructed atop large-scale 3D scene repositories acquired using commodity RGB-D sensors (e.g., Matterport3D, ScanNet). The raw data comprises registered color images, depth maps, and reconstructed surface meshes for real indoor environments. Scene graph annotations are generated through a semi-automatic pipeline: initial object and spatial relation proposals are obtained via geometric reasoning (surface segmentation, clustering), followed by extensive human-in-the-loop refinement to assign semantic classes and relationship types to object pairs.

Each annotator resolves object-identity ambiguities, consolidates visually similar or geometrically overlapping fragments, and verifies the spatial validity of candidate relations. Quality control mechanisms include cross-annotator validation and consistency checks to preserve graph integrity (each pairwise relationship links unique object instances; no orphan nodes or dangling edges). The resulting dataset comprises thousands of fully-annotated scene graphs spanning hundreds of categories and dozens of relation types.

3. Supported Research Tasks and Evaluation Benchmarks

3DSSG is explicitly tailored for graph-based learning paradigms that exploit the richness of 3D spatial context. Primary tasks supported include:

3D semantic scene graph parsing: Predicting instance-level scene-graph structures from raw 3D data, including object categorization and relationship assignment.
Spatial relationship reasoning: Learning to infer geometric and semantic predicates (e.g., "above," "close-to") from object positions and attributes.
3D object detection and segmentation: Joint learning of 3D bounding boxes and semantic classes with scene-graph constraints.
Cross-modal grounding: Associating entities from natural language with spatial graph nodes for embodied AI tasks.

Standard evaluation metrics encompass node classification accuracy, edge (relationship) prediction mean average precision (mAP), and scene graph recall at $K$ (SGRec@K), as well as holistic graph matching scores reflecting the integrity of the predicted 3D scene structure.

4. Methodological Impact and Integration with Graph Neural Architectures

The 3DSSG dataset catalyzes the development of advanced graph neural network (GNN) models for 3D structured data. Recent architectures such as Atomistic Line Graph Neural Networks (ALIGNN) leverage graph and line-graph (bond-angle/relationship) message passing to achieve state-of-the-art predictive performance on atomistic and scene-level properties by explicitly encoding both pairwise and higher-order geometric dependencies (Choudhary et al., 2021, Gurunathan et al., 2022). In such models, object nodes correspond to atoms or macroscopic objects, edge features encode spatial/geometric relations, and the line graph construction enables explicit modeling of triplet-based interactions (i.e., object--relation--object motifs).

The explicit availability of relation-annotated spatial graphs in 3DSSG allows these GNN variants to be trained end-to-end for relational reasoning in realistic scene settings, supporting tasks such as link prediction, relation classification, and global scene graph generation. The dataset thus provides standardized benchmarks and a unified evaluation substrate for GNN models designed to process 3D spatial relationships, with direct implications for embodied AI, robotics, and visual question answering.

5. Limitations, Challenges, and Future Directions

While 3DSSG represents a comprehensive step towards structured modeling of 3D scene understanding, it faces several challenges. Automated annotation of relations remains difficult, especially for ambiguous spatial predicates or occluded interactions. Scene graph completeness is bounded by sensor occlusions and the practicalities of large-scale manual annotation. The current taxonomies of object categories and relations may under-represent long-tail classes or complex composite predicates observed in natural settings.

A plausible implication is that future releases may incorporate uncertainty quantification for annotations, richer attribute vocabularies, or temporal relational information spanning dynamic scenes. The growing intersection with graph neural architectures highlights the need for even more expressive, scalable, and multi-modal 3D relational datasets to drive progress in scene graph induction, comprehensive object understanding, and high-level cognitive reasoning in open-world environments (Choudhary et al., 2021, Gurunathan et al., 2022).