Indoor Scene Perception Bench Overview

Updated 18 December 2025

Indoor Scene Perception Bench is a suite of benchmarks that evaluates models’ abilities in spatial, geometric, and semantic analysis within indoor environments.
It covers tasks from static instance-level layout estimation to dynamic scene reasoning under challenges like occlusion and variable lighting.
The benchmark uses MCQ accuracy and mIoU metrics to quantify model performance, highlighting a significant gap between current MLLMs and human spatial reasoning.

Indoor Scene Perception Bench

Indoor Scene Perception Bench refers to a class of benchmarks, datasets, and evaluation protocols specifically designed to quantitatively assess the capability of machine learning models—and, increasingly, multimodal LLMs (MLLMs)—for parsing, reasoning, and understanding the spatial, geometric, and semantic attributes of indoor environments. This includes static instance-level layout estimation, dynamic object changes, camera-centric and agent-centric spatial relations, and robust scene perception in the presence of occlusion, varied modalities (RGB, LiDAR, mmWave radar), lighting conditions, and across video or embodied agent trajectories. The Indoor Scene Perception Bench is both a specific evaluation suite within MMSI-Video-Bench (Lin et al., 11 Dec 2025) and an organizing term encompassing a spectrum of high-fidelity, interactive, and static benchmarks as surveyed below.

1. Core Objectives and Scope

Indoor Scene Perception Benches are constructed to probe the spatial reasoning and fine-grained perceptual abilities of models on challenging indoor video sequences, RGB-D datasets, radar frames, LiDAR point clouds, and agent trajectories with temporally complex queries. Their scope spans:

Static spatial inference: Instance-centric attributes such as object size, position, and inter-object relations independent of viewpoint.
Camera- and agent-centric spatial reasoning: Understanding left/right/front/back, relative distances, and scene composition from camera or agent perspective.
Dynamic scene analysis: Tracking appearance/disappearance, motion, and interaction of objects and agents across time.
Robustness: Evaluation under occlusion, clutter, sparsity of view, non-uniform lighting, and data modality variation (RGB, RGB-D, LiDAR, radar).

The Indoor Scene Perception Bench within MMSI-Video-Bench explicitly targets MLLMs, but the broader notion encompasses benchmarks for all model families, including classical and deep learning approaches in computer vision and robotics (Lin et al., 11 Dec 2025).

2. Dataset Construction and Design Protocols

The MMSI-Video-Bench Indoor Scene Perception Bench comprises 523 carefully annotated video clips from a pool of 1,106, each with questions targeting spatial reasoning as operationalized by 3DV-trained researchers (Lin et al., 11 Dec 2025). Videos are sourced from established indoor-scene datasets such as RoomTour3D, ScanNet, ScanNet++, 3RScan, ARKitScenes, and RealEstate10k, supplemented with proprietary in-house recordings. Clips are categorized by scene type (e.g., kitchen, living room, corridor, tabletop).

Sample composition: Each video is coupled to one or more human-authored, rationale-backed multiple-choice (MCQ) queries. The question types include:
- Static-instance-centric: E.g., "Which object is closest to the white cabinet?"
- Static-camera-centric: E.g., "From your viewpoint, which door is on your left?"
- Dynamic-scene: E.g., "Which object enters the room between t=10s and t=25s?"
Annotation protocols: All questions, distractors, and answers are generated and peer-reviewed for rigor; each query is associated with explanatory rationales to unambiguously specify the required grounding.
Temporal structure: Videos are sampled at varying frame rates (typically 1–2 FPS for static; up to 8 FPS for dynamic events) to preserve key spatial events, with two main frame sampling regimes: Uniform-50 (50 frames per clip) and Sufficient-Coverage (all annotated frames) (Lin et al., 11 Dec 2025).

3. Task Taxonomy and Evaluation Metrics

The Indoor Scene Perception Bench defines tasks along three axes:

Static-Instance-Centric Understanding: Requires identification of absolute or relative spatial properties among objects (e.g., which chair is nearer to the table).
Static-Camera/Agenct-Centric: Demands correct spatial reasoning from the point of view of the camera or agent, requiring models to resolve viewpoint-relative relations.
Dynamic-Scene Reasoning: Focuses on temporal events—object entry/exit, state changes, and interactions over time.

Sample distributions: 200 static-instance-centric, 165 static-camera-centric, 158 dynamic-scene (of total 523 queries) (Lin et al., 11 Dec 2025).

Primary evaluation: Exact-match accuracy—proportion of queries where the model's answer matches the ground truth. For detection/segmentation benchmarks, mean Intersection-over-Union (mIoU) is sometimes used:

$\mathrm{mIoU} = \frac{1}{C}\sum_{c=1}^C\frac{\mathrm{TP}_c}{\mathrm{TP}_c+\mathrm{FP}_c+\mathrm{FN}_c}$

Performance is reported for each query subtype. Human performance (≈95%) dramatically exceeds all model baselines; best proprietary MLLMs achieve 41.7%, best open source 30.8% (Lin et al., 11 Dec 2025).

The Indoor Scene Perception Bench is situated relative to a family of indoor perception benchmarks:

Benchmark	Modality	Task Coverage	Evaluation Focus
MMSI-Video-Bench Indoor	RGB video	Layout, spatial/dynamic reasoning (MLLMs)	MCQ accuracy
RISE	mmWave radar	Layout, object detection	Chamfer, IoU
Occ-ScanNet	RGB-D, mesh	3D occupancy, semantic segmentation	mIoU
OSMa-Bench	RGB-D, SLAM	Semantic mapping under variable lighting	mIoU, scene graph
Co-VisiON	RGB images	Co-visibility graph induction (sparse views)	Graph IoU
SUN RGB-D, NYU Depth V2	RGB-D	Scene classification, segmentation	Acc., mIoU
OST-Bench	RGB video	Embodied QA (online, spatio-temporal)	Accuracy, MRA

The MMSI-Video-Bench Indoor subset is unique in its focus on fine-grained, rationale-backed MCQs over real-world video, targeting both static and dynamic spatial understanding and emphasizing explanation-grounded annotation and error analysis (Lin et al., 11 Dec 2025).

5. Key Findings and Model Performance

The Indoor Scene Perception Bench exposes fundamental limitations in current MLLMs and spatial intelligence systems:

Human–AI gap: All tested models—proprietary and open-source—score well below human performance (human ≈95%, best proprietary 41.7%, best open-source 30.8%) (Lin et al., 11 Dec 2025).
Failure modes: Key bottlenecks include:
- Geometric reasoning (e.g., mis-judging object proximity in occluded or cluttered environments).
- Fine-grained localization/grounding, with errors in detecting or distinguishing small/similarly colored objects.
- Prompt misalignment, such as misinterpreting viewpoint-relative concepts.
Ablation results: Tested strategies such as chain-of-thought prompting, alternative frame samplers, or explicit 3D spatial cues (VGGT) yield at most marginal (<1%) improvements.
Generalization: Models trained for spatial QA or video reasoning fail to transfer robustly to this benchmark, indicating that current datasets and training routines are insufficient for fine-grained indoor scene understanding.

6. Limitations, Open Research Challenges, and Future Directions

Limitations exposed by the Indoor Scene Perception Bench include:

Data diversity: Despite 523 video samples, coverage of extreme clutter, rare object categories, or complex 3D navigation scenarios is limited relative to the full real-world distribution.
Label granularity: Current annotations rely primarily on MCQ and high-level reasoning; pixelwise masks and object tracks are not always explicitly required or evaluated.
Semantic/instance ambiguity: Models struggle with overlapping and similar-sized objects, and with transferring reasoning from world to camera-centric frames.
Static vs. embodied: Offline video analysis does not exercise memory, goal-directed actions, or agent-specific navigation in the way online embodied benchmarks (e.g., OST-Bench (Lin et al., 10 Jul 2025)) do.

Ongoing and future research directions include:

Model architecture improvements: Developing spatial modules with explicit 3D representations, robust viewpoint transformation, and camera/agent-centric grounding.
Enhanced annotation protocols: Combining MCQs with dense, high-quality instance masks and multi-layer rationales to enable finer error diagnosis.
Benchmark extensibility: Incorporating interactive, goal-directed scenarios; dynamic object manipulation; and integration with SLAM or 3D mapping tasks.

7. Impact and Implications for Spatial AI

The Indoor Scene Perception Bench provides a gold-standard for evaluating the spatial and temporal reasoning capabilities of perceptual and LLMs in the challenging context of indoor environments. By formalizing the gap between human and artificial spatial understanding in both static and dynamic settings, the benchmark motivates explicit work on geometric reasoning, data curation, annotation quality, and error interpretability in MLLMs and other spatial AI systems (Lin et al., 11 Dec 2025). A plausible implication is that solving the hardest subtypes of the benchmark will require explicit grounding, 3D scene representation, and attention to instance-level dynamics beyond what current models achieve, setting a trajectory for the next generation of embodied spatial understanding research.