Space3D-Bench: Multimodal 3D QA Benchmark
- Space3D-Bench is a spatial 3D question answering benchmark featuring 1,000 manually curated queries across 13 Replica scenes and six spatial reasoning categories.
- It integrates four core data modalities—point clouds, RGB-D images, navigation meshes, and object detections—to enable comprehensive evaluation of 3D reasoning models.
- The benchmark employs an automated assessment pipeline using GPT-4V along with a retrieval-augmented baseline system, enhancing both accuracy and cross-modal consistency.
Space3D-Bench is a spatial 3D question answering benchmark designed to advance the evaluation and development of language and vision foundation models for reasoning about real-world 3D environments. It features 1,000 manually-curated questions and answers distributed across diverse indoor Replica scenes and incorporates four heterogeneous data modalities: point clouds, posed RGB-D images, navigation meshes, and object-level 3D detections. The benchmark provides a balanced taxonomy of six spatial reasoning categories, a robust automated assessment pipeline integrating vision-LLMs (VLMs), and a modular open-source baseline system for context-rich 3D reasoning (Szymanska et al., 2024).
1. Motivation and Contributions
Space3D-Bench addresses key limitations of prior 3D QA datasets, which often focus narrowly (e.g., object existence, basic spatial relations) or offer limited modalities (RGB-D only, point clouds only, or lack navigable geometry). Existing benchmarks typically display unbalanced question-type distributions, underrepresenting navigation-distance, predictive, and pattern-identification queries. There was previously no unified, multimodal protocol for the automatic grading of free-form answers using both textual and visual evidence.
The principal contributions are:
- A dataset of 1,000 spatial questions and answers across 13 Replica indoor scenes, balanced over six spatial reasoning categories derived from geographic information systems (GIS) literature.
- Integration of four core data modalities: point clouds, posed RGB-D images, navigation meshes, and object detections.
- An automatic assessment protocol employing GPT-4V for dual-mode grading: factual ground-truth checks and cross-modal answer cross-checking.
- RAG3D-Chat, a retrieval-augmented baseline agent chaining specialized modules (Image, Text, SQL, Navigation) via the Semantic Kernel framework.
- Public release of all resources to facilitate further research into robust spatial 3D question answering.
2. Dataset Construction and Taxonomy
Replica Scenes and Data Modalities
Space3D-Bench encompasses 13 Replica scenes, subdivided as follows: six multi-room environments (including apartments and a hotel), and seven single-room settings (offices and small apartments). For each scene, all four data modalities are provided:
- Point clouds and meshed reconstructions (PLY/OBJ) with per-vertex semantic labels and cleaned navigation mesh overlays.
- RGB-D images and semantic segmentations captured in posed sequences with full extrinsic/intrinsic camera metadata.
- Navigation meshes consisting of traversable polygons and graph connectivity for precise geodesic path calculations.
- Object detections stored as JSON tables, relabeled for consistency, cleaned of extraneous entities, and each assigned to its containing room using axis-aligned bounding boxes (AABB).
Additional annotations include room-level bounding box, center, and dimensions.
Question Generation Pipeline
Questions were manually authored to avoid ambiguity, with allocation per scene determined as follows: 100 questions per multi-room scene, 60 per apartment room and two offices, and 50 for another pair of offices—yielding exactly 1,000 questions. Answers are given in two formats: (a) factual text and (b) illustrative images for creative reasoning categories.
Indoor Spatial Question Taxonomy
Deriving from Schmidts & Giner's GIS-inspired schema, the taxonomy comprises six categories:
- Location: Presence and placement queries (“Which rooms have no plants?”).
- Measurement: Quantitative queries (“How many lamps in all bedrooms?”).
- Relation: Spatial relationships (“Which objects are within 2 m of the sofa?”).
- Navigation: Distance/path estimation (“How far to walk from kitchen to dining room?”).
- Pattern: Identifying uniformity/similarity (“Which rooms have the same number of chairs?”).
- Prediction: Inferring properties (“How many people could comfortably sit in the dining room?”).
Distribution is nearly uniform at ≈167 questions per category, ensuring balanced coverage.
3. Data Modalities and Annotations
The modalities facilitate distinct types of queries and module operations:
| Modality | Format/Metadata | Primary Use Cases |
|---|---|---|
| Point Clouds/Meshes | PLY/OBJ, tris, semantic labels | Geometric queries, bounding box checks |
| RGB-D/Segmentation | PNG, 16-bit depth, masks, poses | Visual grounding, cross-check images |
| Object Detections | JSON, classes, L×W×H, room_id | Counts, relations, SQL queries |
| Navigation Mesh | Traversable polygons, graph | Geodesic distance, navigation planning |
Each modality is preprocessed to support robust retrieval, consistent annotation, and error minimization (e.g., class relabeling, removal of irrelevant artifact detections).
4. Automatic Assessment System
Assessment Pipeline
For each (question, system answer) pair, the system determines an assessment mode:
- Ground Truth Check (GT): Applies to factual queries (counts, distances), comparing system answers to explicit ground-truth excerpts.
- Answer Cross-check (AC): For creative/open-ended queries, utilizes reference images and example correct answers for evaluation.
The pipeline constructs specialized prompts for GPT-4V, which outputs an ACCEPT/REJECT decision and text justification. Output decisions and justifications are logged for analysis.
Evaluation Metrics and Human Verification
Automated grading is measured by:
- Accuracy: Proportion of accepted answers.
- User-Agreement Rate (UGR): Fraction where auto decision matches the human majority.
- Weighted Agreement Score (WAS): where is the majority vote size for question .
A user study with 60 participants (40 randomized questions + 10 abstracted) achieved 97.5% absolute agreement (39/40 items), underlining system reliability. One discrepancy involved count-based queries where omission of zero-count cases was penalized by the system but accepted by humans.
5. RAG3D-Chat Baseline System
Modular Architecture
RAG3D-Chat is a Semantic Kernel agent employing four retrieval-augmented modules:
- Image Module: CLIP-based vector store for image retrieval; answers visual questions via GPT-4V.
- Text Module: ADA-002 embeddings for room descriptions; retrieves and synthesizes textual context.
- SQL Module: SQLite backend (objects, rooms); employs ADA-002 embedding-based mapping and LLM-driven query generation/execution for relational and count queries.
- Navigation Module: LLM parses query points; computes straight-line or geodesic (via Dijkstra) distances on the navigation mesh.
The planner decomposes each question, executes module chains (e.g., SQL→Text→Navigation), and integrates outputs.
Quantitative Performance
On the full benchmark (1,000 questions), RAG3D-Chat attains 66.8% accuracy. Category-wise performance is as follows (approximate):
- Location: ~70%
- Measurement: ~68%
- Relation: ~65%
- Navigation: ~75% (highest)
- Pattern: ~62%
- Prediction: ~51% (lowest; 76 failures)
Major failure modes include undercounting in prediction questions (area-based capacity estimation instead of furniture count) and SQL zero-count edge cases read as existence.
6. Limitations and Future Directions
Identified limitations include:
- Restricted to Replica scenes; lacks diversity from multi-floor or outdoor environments.
- Navigation questions currently return only distances, omitting trajectory or turn-by-turn instructions.
- Some natural-language ambiguity persists (e.g., “at least X chairs” vs. “exactly X chairs”).
- The baseline system exhibits a semantic-quantitative gap, notably missing integrated object attributes within quantitative modules.
Prospective improvements entail:
- Extending navigation to trajectory generation and instruction synthesis.
- Enriching the SQL schema with textual object attributes (enabling queries such as “Which blue chairs are closest to the table?”).
- Enlarging user studies to refine acceptance protocols.
- Exploring end-to-end multimodal transformers and improved grounding for predictive reasoning.
- Adapting Space3D-Bench to alternative scene datasets (e.g., ScanNet, 3RScan) to probe benchmark generalization.
Space3D-Bench establishes a rich, modality-diverse, and balanced platform for spatial 3D question answering, facilitating progressive research through its open resources, automated VLM-based assessment framework, and modular baseline agent (Szymanska et al., 2024).