Video2Layout: Metric Spatial Mapping

Updated 16 May 2026

Video2Layout is a method that converts ego-centric video sequences into metric-grounded cognitive maps using continuous geometric primitives.
It leverages both supervised and reinforcement fine-tuning to accurately predict object boundaries, distances, and spatial relations.
The approach enhances spatial reasoning for robotics, AR/VR, and indoor tracking through precise BEV scene mapping and multimodal fusion.

Video2Layout refers to a class of methods and specific frameworks that transform video data—typically ego-centric multi-frame visual sequences—into explicit, metric-grounded spatial layouts suitable for fine-grained spatial reasoning and map-based downstream tasks. In contrast to grid-map-based cognitive mapping, Video2Layout leverages continuous geometric primitives (e.g., object boundaries, room corners, 2D/3D bounding boxes) for superior quantitative spatial computation and granular object localization. This approach enables a model to infer precise distances, sizes, and spatial relations among scene entities, forming the substrate for rigorous spatial reasoning in multimodal LLMs (MLLMs), robotics, and AR/VR applications (Huang et al., 20 Nov 2025, Guo et al., 11 Mar 2025).

1. Conceptual Foundations

Video2Layout is motivated by the shortcomings of discretized raster grid-maps in cognitive scene representation. While grid-maps spatially partition scenes into fixed-size cells, they inherently lose fine spatial granularity, limiting their utility in tasks requiring metric accuracy (e.g., answering “Which object is closer to the door?” or “Estimate the table’s footprint”). Replacing grid-occupancy with continuous, coordinate-based boundary prediction allows the model to perform direct numerical comparisons, resolve small-object ambiguities, and render spatial descriptions less ambiguous (Huang et al., 20 Nov 2025).

This paradigm draws upon developments in spatial intelligence for MLLMs, robotics mapping, and layout estimation, extending beyond static images to multi-view temporal integration. Recent pipelines combine semantic visual inference with sequential geometric aggregation to produce temporally coherent, metric-grounded cognitive maps (Huang et al., 20 Nov 2025, Guo et al., 11 Mar 2025).

2. System Architecture and Pipeline

A typical Video2Layout pipeline comprises the following stages:

Supervised Fine-Tuning (SFT): The model learns to map video sequences to precise object boundaries using annotated simulation or synthetic datasets (e.g., AI2THOR), with each training sample comprising one or more RGB frames, per-pixel object masks, and known camera parameters. The output is a BEV (bird’s-eye view) scene map: a structured set of object names and continuous bounding boxes $B_i = (x_i^{\min}, y_i^{\min}, x_i^{\max}, y_i^{\max})$ in a Cartesian coordinate frame.
Reinforcement Fine-Tuning (RFT): To bridge the sim-to-real gap, the model is further trained on weakly-supervised or real-world QA data (e.g., ScanNet) via a PPO-style (GRPO) objective, leveraging rewards for map format compliance, multi-choice QA correctness, and numerical answer accuracy. No ground-truth coordinates are provided in RFT (Huang et al., 20 Nov 2025).

At inference, the vision encoder processes each frame, aggregates information temporally, and emits a set of object boundary boxes. Downstream reasoning (such as chain-of-thought over continuous coordinates) is then possible.

A schematic pipeline for indoor room layout estimation under dynamic conditions, as exemplified by the Ev-Layout dataset, integrates RGB video, high-rate event streams, and IMU measurements (Guo et al., 11 Mar 2025). Events supplement RGB with blur-free edge detection, aiding geometric recovery under rapid motion and challenging lighting. Transformer-based fusion modules receive both modalities, injecting event-temporal statistics to bias spatial attention and enhance structural prediction.

3. Continuous Layout Representations

Continuous boundary or corner representations are central to Video2Layout. For object-level mapping as in (Huang et al., 20 Nov 2025):

Each object is described by a bounding box:

$B_i = (x_i^{\min}, y_i^{\min}, x_i^{\max}, y_i^{\max}),$

with the object center $c_i = \left(\frac{x_i^{\min} + x_i^{\max}}{2}, \frac{y_i^{\min} + y_i^{\max}}{2}\right)$ , size $s_i = (x_i^{\max} - x_i^{\min}) \cdot (y_i^{\max} - y_i^{\min})$ , and inter-object distances $d_{ij}$ computed as Euclidean norms in the BEV plane.

For room layout estimation (Ev-Layout (Guo et al., 11 Mar 2025)), the Manhattan-world prior yields a compact layout parameterization by the 2D or 3D coordinates of floor-wall-ceiling corners:

2D image corners: $p_i = (u_i, v_i),\ i=1\ldots4$ .
3D rays: $C_i = s_i K^{-1}[u_i, v_i, 1]^\top$ , $K$ being the intrinsic matrix.
Planes: $\Pi_j: n_j^\top X + d_j = 0$ , recovered from classified lines via RANSAC.

This enables loss functions based directly on L1 or L2 distances between predicted and ground truth polygons or bounding boxes.

4. Training Objectives and Data

Losses for boundary prediction in SFT:

Coordinate regression, typically

$\mathcal{L}_{coord} = \sum_{i=1}^{N}\|\hat{B}_i - B_i\|_1$

or its L2 variant.

Reinforcement fine-tuning employs a PPO-style clipped advantage objective, with rewards combining format, MCQ, and numerical answer correctness. Specifically, format compliance ( $B_i = (x_i^{\min}, y_i^{\min}, x_i^{\max}, y_i^{\max}),$ 0), multi-choice QA accuracy ( $B_i = (x_i^{\min}, y_i^{\min}, x_i^{\max}, y_i^{\max}),$ 1), and numerical agreement across various thresholds ( $B_i = (x_i^{\min}, y_i^{\min}, x_i^{\max}, y_i^{\max}),$ 2) are weighted and combined:

$B_i = (x_i^{\min}, y_i^{\min}, x_i^{\max}, y_i^{\max}),$ 3
Advantages: $B_i = (x_i^{\min}, y_i^{\min}, x_i^{\max}, y_i^{\max}),$ 4

The primary supervised training corpus is generated in simulation (AI2THOR), with BEV-aligned ground truth projections. The QVS-Bench diagnostic split is crafted to quantify the effect of input-frame count on spatial reasoning accuracy and map fidelity (Huang et al., 20 Nov 2025).

Event-based layout estimation supplements video with event streams and IMU data. Temporal event patterns are aggregated into per-patch Poisson histograms; Kullback-Leibler divergence between patch distributions guides transformer attention (event-temporal distribution feature, ETDF). Losses for line and junction prediction, line-type classification, and, optionally, temporal tracking consistency are adopted (Guo et al., 11 Mar 2025).

5. Benchmarks and Quantitative Results

QVS-Bench provides 4,000 real-world QA samples categorized into:

Object Size (numerical)
Relative Distance (MCQ)
Object Count (numerical)
Minimum Distance (numerical)
Vertical/Horizontal Direction (MCQ)

Input lengths are systematically varied (1/4/8/12/16 frames). Evaluation combines MCQ accuracy and relative accuracy for numericals, alongside structural metrics (box size, center distance, angle accuracy).

Main results from (Huang et al., 20 Nov 2025):

V2LO-7B achieves +4.92 percentage points over best grid-map baseline on QVS-Bench; 67.77% overall vs 31.11% for base MLLM.
For spatial reasoning aggregate benchmarks (EmbSpatial-Bench, ViewSpatial-Bench, OmniSpatial-Bench, SPAR-Bench), V2LO-7B attains 47.46% (vs. 44.17% baseline, +3.29pp).
Per-task QVS-Bench improvements: Vertical Direction: 82.20% (+53.1 pp); Object Size: 63.56%; Minimum Distance: 30.70%.

Ev-Layout (Guo et al., 11 Mar 2025) reports, for event+video-based layout estimation in challenging conditions:

sAP $B_i = (x_i^{\min}, y_i^{\min}, x_i^{\max}, y_i^{\max}),$ 5 (mean structural average precision) up to 48.6 vs. 24.2 for classical line+classifier baselines.
High temporal fidelity and robustness under rapid motion (up to 12 rad/s angular rates).

6. Application Domains and Extensions

Video2Layout enables multiple downstream applications:

Domain	Video2Layout Role	Representative Tasks
Spatial QA in MLLMs	Provides explicit, metric, BEV cognitive maps	“Is the stove left of the sink?”
Robotics	Supplies accurate fine-grained scene geometry for planning/navigation	Obstacle avoidance, semantic mapping
AR/VR Layout	Delivers rapid 3D layout recovery under dynamic view/motion/lighting	Room reconstruction, VR scene anchoring
Indoor Tracking	Enhances room layout tracking with multi-modal sensor fusion	Real-time mapping, temporal consistency

A plausible implication is that as cognitive agents are increasingly tasked with embodied interaction or spatial question answering, the explicit layout representation frameworks typified by Video2Layout will displace coarse raster grid-maps in all domains demanding fine-grained quantitative spatial support.

7. Current Limitations and Future Directions

Current Video2Layout pipelines (as of (Huang et al., 20 Nov 2025, Guo et al., 11 Mar 2025)) focus on 2D BEV layouts and are predominantly validated on indoor environments. Identified limitations:

Spatial map accuracy degrades as frame count increases beyond 12—excessive temporal aggregation can overwhelm the model, motivating research on attention-based filtering and frame selection strategies.
The absence of direct depth or point-cloud integration constrains extension to full 3D layouts or outdoor scenes.
Domain adaptation remains necessary for generalization beyond simulation-trained distributions; future directions likely include self-supervised map representations and multimodal fusion with lidar or SLAM priors.
Real-time, robust temporal smoothing and frame-to-frame object correspondence in highly dynamic scenes is an open research area, with Kalman filter or learned recurrent mechanisms as plausible augmentation paths.

References

"Video2Layout: Recall and Reconstruct Metric-Grounded Cognitive Map for Spatial Reasoning" (Huang et al., 20 Nov 2025)
"Ev-Layout: A Large-scale Event-based Multi-modal Dataset for Indoor Layout Estimation and Tracking" (Guo et al., 11 Mar 2025)

Markdown Report Issue Upgrade to Chat

References (2)

Video2Layout: Recall and Reconstruct Metric-Grounded Cognitive Map for Spatial Reasoning (2025)

Ev-Layout: A Large-scale Event-based Multi-modal Dataset for Indoor Layout Estimation and Tracking (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Video2Layout.