PRISM: A Multi-View Multi-Capability Retail Video Dataset for Embodied Vision-Language Models

Published 31 Mar 2026 in cs.CV, cs.AI, and cs.RO | (2603.29281v1)

Abstract: A critical gap exists between the general-purpose visual understanding of state-of-the-art physical AI models and the specialized perceptual demands of structured real-world deployment environments. We present PRISM, a 270K-sample multi-view video supervised fine-tuning (SFT) corpus for embodied vision-language-models (VLMs) in real-world retail environments. PRISM is motivated by a simple observation - physical AI systems fail not because of poor visual recognition, but because they do not understand space, physical dynamics and embodied action well enough to operate reliably in the world. To this end, PRISM is grounded in a novel three-dimensional knowledge ontology that spans spatial knowledge, temporal and physical knowledge, and embodied action knowledge. It covers 20+ capability probes across four evaluation dimensions - Embodied Reasoning (ER), Common Sense (CS), Spatial Perception (SP), and Intuitive Physics (IP), and to our knowledge, PRISM is the first dataset to instantiate all three knowledge dimensions within a single real-world deployment domain. The corpus captures data from egocentric, exocentric and 360° viewpoints across five supermarket locations and includes open-ended, chain-of-thought, and multiple-choice supervision. At 4 fps, PRISM spans approximately 11.8M video frames and approximately 730M tokens, placing it among the largest domain-specific video SFT corpora. Fine-tuning on PRISM reduces the error rate across all 20+ probes by 66.6% over the pre-trained baseline, with significant gains in embodied action understanding where the accuracy improves by 36.4%. Our results suggest that ontology-structured, domain specific SFT can meaningfully strengthen embodied VLMs for real-world settings. The PRISM dataset and more details are available at https://dreamvu.ai/prism

Abstract PDF Upgrade to Chat

Authors (9)

Summary

The paper presents a novel PRISM dataset that overcomes conventional video-language limitations by providing multi-view, ontology-structured data for embodied AI in retail.
It employs a hybrid annotation pipeline that integrates LLM-generated chain-of-thought, physics-grounded reasoning, and synchronized multi-modal video to ensure high-fidelity supervision.
Empirical results demonstrate significant improvements, including a 23.8% accuracy increase and up to five-fold error reduction in embodied reasoning tasks.

PRISM: A Multi-View, Multi-Capability Retail Video Dataset for Embodied Vision-LLMs

The PRISM dataset establishes a new methodological and empirical standard for training and evaluating embodied vision-LLMs (VLMs) in structured retail environments. It provides the first large-scale, multi-view, and ontology-structured instruction-tuning corpus designed to comprehensively cover all key perceptual and reasoning capabilities required for real-world deployment of physical AI systems.

Motivation and Problem Statement

Conventional VLMs and video-language datasets exhibit critical limitations for embodied AI in physical settings, particularly in domains like retail. These gaps include inadequate coverage of spatial, temporal, and embodied action knowledge; a lack of synchronized egocentric and exocentric video; restricted capability probes; and non-systematic curriculum composition. PRISM directly addresses these deficits via (i) a tightly defined knowledge ontology spanning space, time, and embodied action; (ii) balanced coverage of synchronized egocentric, exocentric, and 360° panoramic video; and (iii) finely differentiated annotation protocols enabling chain-of-thought (CoT), multi-choice, and open-ended supervision. The unified framework renders PRISM especially suited for training robust, generalizable embodied VLMs in complex, dynamic environments such as retail stores.

Figure 1: PRISM captures multi-view retail video from four synchronized modalities—egocentric, exocentric, 360° panoramic, and depth—enabling ontology-aligned supervision for robust, generalizable VLM adaptation.

Dataset Construction and Ontology

PRISM comprises 270,000 supervised fine-tuning (SFT) samples systematically distributed across 20+ distinct capability probes. The samples are drawn from video recorded in five diverse, structured retail environments using both wearable (egocentric) and omnidirectional (ALIA 360°) exocentric cameras. The ontology is explicitly factored into four capability domains—Embodied Reasoning (ER), Common Sense (CS), Spatial Perception (SP), and Intuitive Physics (IP)—which together instantiate a triadic knowledge structure: spatial understanding, temporal/causal inference, and embodied action reasoning.

The annotation pipeline integrates five strategies: (1) structured metadata extraction via domain-specific task logs, (2) LLM-based (Gemini 2.5 Flash) QA and CoT generation, (3) physics-grounded reasoning (Gemini Robotics ER 1.5) over video, (4) depth analysis (DepthCrafter) for spatial statistics, and (5) self-supervised transformations for physical pretext tasks (e.g., arrow-of-time, object permanence). This hybrid protocol yields high-diversity, high-fidelity supervision across all capabilities with minimal human cost.

(Figure 2)

Figure 2: PRISM pipeline overview showing data acquisition, multi-modal annotation, ontology-structured capability probes, and VLM fine-tuning integration across ego/exo/360° views.

Task Suite and Instruction Formats

Each capability probe implements a unique, operational challenge derived from practical deployment scenarios:

Embodied Reasoning (ER): Sequential subtask prediction, task completion verification, goal-conditioned action reasoning, cross-view action matching, fine-grained hand interaction recognition, atomic action recognition/reasoning (exocentric), multi-actor scene analysis, and social navigation reasoning.
Common Sense (CS): Scene/environment VQA (ego/exo), depth-aware spatial CoT, affordance and causality reasoning, and exocentric spatial inference.
Spatial Perception (SP): Relative depth ordering and panoramic 360° spatial layout reasoning.
Intuitive Physics (IP): Arrow-of-time discrimination (ego/exo), CoT over physical motion cues, and object permanence.

Each probe supports either open-ended understanding, CoT with $\langle$ think $\rangle$ tags, or MCQ overlays, with a balanced mix to eliminate format-induced learning bias. The compositional diversity is unmatched in prior datasets and critically supports ablation and curriculum studies.

Experimental Protocol and Results

Fine-tuning experiments leverage Cosmos-Reason2-2B, a Qwen3-VL-based VLM optimized for video-language physical reasoning. PRISM training is performed with LoRA adaptation (49.3M parameters, BF16) to ensure resource-efficiency and analytic clarity. The evaluation suite measures MCQ accuracy (task-specific multiple choice) and GPT-4o auto-judged CoT quality, with rigorous hold-out validation for each capability.

Key findings:

Aggregate Performance: PRISM fine-tuning increases average accuracy across all tasks from 62.8% (zero-shot baseline) to 86.6% (+23.8%), with an average error reduction of 66.6%.
Embodied Reasoning: ER domains observe up to five-fold reduction in error rates. Notably, goal-conditioned action reasoning and atomic action CoT see >+55% absolute accuracy gains over baseline.
Cross-View Generalization: Incorporating exocentric supervision significantly improves cross-view reasoning without degrading egocentric capabilities, validating the hypothesis that multi-view curriculum is synergistic.
Chain-of-Thought Supervision: LLM-generated CoT labels yield stronger performance improvements than template-based alternatives, confirming that supervision format directly impacts learning signal quality and reasoning robustness.
Data Scaling: 60% of the dataset achieves approximately 95% of total attainable gain, highlighting data efficiency for practical deployment.

(Figure 3)

Figure 3: Representative examples of PRISM’s diverse capability probes, including sequential next-action prediction, scene VQA, egocentric–exocentric matching, depth-aware spatial reasoning, and physics-grounded chain-of-thought outputs.

(Figure 4)

Figure 4: Scaling analysis—accuracy as a function of training progress, by domain and overall, showing rapid gains in early stages with diminishing returns after ~60% data utilization.

Theoretical and Practical Implications

PRISM’s methodology substantially extends the state of the art in both embodied VLM training and dataset design. Notable implications include:

Unified, Domain-Aligned Instruction-Tuning: By enforcing deployment-specific, ontology-driven supervision, PRISM enables the development and evaluation of VLMs that are explicitly matched to the complex, multi-agent, multi-view challenges of real retail environments.
Ego-Exo Curriculum as a Fundamental Axis: The results strongly argue that multi-view learning (especially panoramic exocentric) is essential for robust, deployment-relevant perception and long-horizon task planning—contradicting prior assumptions over reliance on purely egocentric corpora.
Annotation Pipeline Efficiency: The demonstrated scalability via hybrid, LLM-augmented annotation sets a framework for future data curation across other structured domains (e.g., healthcare, logistics) without excessive manual cost.
Format-Driven Reasoning Generalization: The systematic advantage of chain-of-thought over template supervision further supports the stance that annotation design—not solely model scale—mediates successful reasoning adaptation.

Limitations and Future Directions

The primary limitation is the focus on a single VLM backbone (Cosmos-Reason2-2B), which constrains generalization of scaling analysis to larger architectures. Automated evaluation (MCQ, GPT-4o) may underestimate emergent natural language capabilities on harder tasks. Future extensions include: scaling analysis on 7B–13B class VLMs, incorporating additional sensors (metric depth, IMUs), downstream deployment in closed-loop robotic control (e.g., GR00T pipeline), and collection of further data from new domains and retail settings to probe generalization limits.

Conclusion

PRISM delivers the first truly comprehensive, multi-view, capability-rich, and instruction-format balanced SFT dataset for embodied vision-LLMs in retail. The empirical evaluation demonstrates large, systematic improvements over state-of-the-art generalist video-LLMs, with the greatest impact in action reasoning, cross-view understanding, and spatial/temporal physical knowledge. Both PRISM and its findings constitute essential infrastructure for the next generation of robust, adaptable, and generalizable embodied AI systems.

Markdown Report Issue