MP-PVIR: Multi-view Phase-Aware Incident Reasoning

Updated 25 November 2025

The paper presents a four-stage automated pipeline that segments pedestrian-vehicle incidents into five distinct cognitive phases using phase-aware vision-language models.
It leverages synchronized multi-view video streams and detailed temporal grounding to achieve robust phase-specific captioning and high QA accuracy.
The generated diagnostic reports offer actionable causal insights and prevention strategies that support vehicle-infrastructure cooperative safety systems.

Multi-view Phase-aware Pedestrian-Vehicle Incident Reasoning (MP-PVIR) is a fully automated framework designed to provide structured, actionable analysis of pedestrian-vehicle incidents from synchronized multi-camera video streams. MP-PVIR advances traditional video-based traffic safety systems by decomposing incidents into cognitive-behavioral phases, performing phase-wise multi-view reasoning, and delivering human-readable diagnostic reports that include causal analysis and targeted prevention recommendations. The framework leverages vision-LLMs (VLMs) fine-tuned for temporal and multi-view understanding, as well as a LLM for hierarchical synthesis, and demonstrates strong performance on the Woven Traffic Safety (WTS) dataset (Zhen et al., 18 Nov 2025).

1. System Overview and Stages

MP-PVIR consists of a four-stage pipeline:

Event-Triggered Multi-View Acquisition & Synchronization: Overhead and in-vehicle video streams are monitored using a lightweight trigger (e.g., proximity and relative velocity threshold). Upon detection of an “Event of Interest,” buffered clips are retrieved and temporally synchronized via hardware timestamps or, if absent, motion-energy cross-correlation. Resulting data are a set of aligned video streams $\{V_1(t), V_2(t), \dots, V_N(t)\}$ for $t \in [0, T]$ .
Pedestrian Behavior Phase Segmentation: The temporal grounding VLM (TG-VLM) segments the synchronized event into five distinct behavioral phases—pre-recognition, recognition, judgment, action, and avoidance. For each phase $k \in \{1, \dots, 5\}$ , TG-VLM predicts interval boundaries $b_k = (t_k^{\rm start}, t_k^{\rm end})$ .
Phase-Specific Multi-View Reasoning: Phase-specific multidimensional evidence is extracted via PhaVR-VLM, which generates dense descriptions $C_{i,k}$ and answers to targeted video QA probes $\mathcal{A}_{i,k}$ for each camera view $i$ and phase $k$ .
Hierarchical Synthesis & Diagnostic Reasoning: All phase boundaries, captions, and QA outputs are structured into a JSON payload and passed to an LLM (Claude Opus 4). The LLM composes comprehensive, causal incident reports and recommends prevention measures.

Each stage prepares and transforms data for the subsequent step—culminating in a prevention-oriented diagnostic capable of supporting vehicle-infrastructure cooperative systems.

2. Model Architectures and Core Modules

TG-VLM: Temporal Grounding Vision-LLM

TG-VLM is based on the Qwen-2.5-VL-7B backbone with LoRA (Low-Rank Adaptation) modules in each Transformer layer. Input consists of multi-view video frames sampled at 2 FPS, processed by a ViT (Vision Transformer) encoder, and temporally serialized: $V_{\rm input} = [\mathrm{tokens}(V_1)\;\|\;\mathrm{tokens}(V_2)\;\|\;\dots\;\|\;\mathrm{tokens}(V_N)]$ The Qwen decoder, conditioned on a prompt defining the behavioral phases, causally decodes 10 boundary tokens $(t_1^{\rm start}, t_1^{\rm end}, ..., t_5^{\rm start}, t_5^{\rm end})$ . Training minimizes causal-LM cross-entropy over boundary tokens: $\mathcal{L}_{\rm TG} = -\sum_{j=1}^{10} \log P\bigl(\hat b_j \mid b_{<j}, V_{\rm input}\bigr)$ Performance is evaluated by mean Intersection-over-Union (mIoU) across phases: $\mathrm{mIoU} = \frac1{5M}\sum_{k=1}^5\sum_{m=1}^M \frac{|P_{k,m} \cap G_{k,m}|}{|P_{k,m} \cup G_{k,m}|}$

PhaVR-VLM: Phase-aware Video Reasoning

PhaVR-VLM utilizes the same backbone and LoRA tuning. It receives clipped, phase-specific sub-videos for each view, coded with Multimodal Rotary Positional Embedding (MROPE) for absolute-time awareness. Cross-attention mechanisms in the Transformer blocks natively learn inter-view correspondences; no explicit fusion layers are necessary.

Captioning: Decoder generates $C_{i,k}$ per-view, per-phase captions, supervised by cross-entropy loss.
Visual Question Answering (VQA): Model decodes multiple-choice QA on phase-specific features, with accuracy measured by selection of the correct choice.

A composite caption score aggregates BLEU-4, ROUGE-L, METEOR, and scaled CIDEr metrics: $\text{Score} = \tfrac{1}{4}(\mathrm{BLEU} + \mathrm{ROUGE\!-\!L} + \mathrm{METEOR} + 0.1 \times \mathrm{CIDEr})$

3. Multi-View Temporal and Spatial Integration

MP-PVIR’s multi-view capabilities are grounded in two principles:

Temporal Serialization: Token streams from all synchronized views are concatenated such that cross-view relationships can be modeled directly by Transformer self-attention. This enables simultaneous access to all viewpoints at each timestep.
Common Temporal Reference: The framework aligns camera feeds either by hardware timestamps or by feature-based alignment (motion-energy cross-correlation). This ensures that all object and scene interactions are analyzed on a shared temporal axis.

The self-attention process, parameterized by softmax attention weights,

$A = \mathrm{softmax}\left(\frac{QK^\top}{\sqrt{d}}\right)$

enables direct correspondence among views with no explicit geometric warping.

4. Phase-Aware Reasoning and Causal Chain Construction

MP-PVIR operationalizes behavioral theory by segmenting pedestrian-vehicle events into pre-recognition, recognition, judgment, action, and avoidance. This phase decomposition is computed by TG-VLM, generating temporal anchors that organize subsequent evidence gathering.

Within each phase, PhaVR-VLM analyzes view-specific clips to produce phase-tailored captions and VQA results. These outputs feed into the LLM, which constructs a causal chain mapping each phase’s observed factor (e.g., distraction, absence of crosswalk, vehicular maneuver) to its impact and proposes a corresponding prevention step.

The hierarchical reasoning recipe is formally structured:

Assemble all phase boundaries ( $\mathcal{B}$ ), captions ( $\mathcal{C}$ ), and VQA answers ( $\mathcal{A}$ ).
Engineer a prompt to the LLM directing it to: (1) summarize the event, (2) compare per-phase multi-view evidence, (3) elucidate causal chains, (4) propose per-phase prevention strategies.
Validate the incident report output against the IncidentReportSchema before delivery.

5. Evaluation on the Woven Traffic Safety (WTS) Dataset

The WTS dataset comprises more than 1,200 staged multi-view pedestrian-vehicle incidents, each annotated with phase boundaries, captions, and phase-specific QA labels. MP-PVIR’s evaluation reports:

TG-VLM (Phase Segmentation): Mean IoU = 0.4881 overall
- Pre-recognition: 0.7887
- Recognition: 0.5091
- Judgment: 0.3662
- Action: 0.4208
- Avoidance: 0.3559
PhaVR-VLM (Multi-view Reasoning)
- Captioning (composite score): 33.063 (baseline: 30.03)
- VQA Accuracy:
- Vehicle-view: 64.70% (baseline: 49.34%)
- Overhead-view: 50.48% (baseline: 37.94%)
- Environment: 54.44% (baselines fail)
- Valid-choice rate on QA is 100% for the proposed approach.

Absence of multi-view fine-tuning (i.e., unfine-tuned backbones) leads to task failure; performance is contingent on PEFT LoRA integration and multi-view alignment.

Stage	Metric	Result
Segmentation	mIoU (overall)	0.4881
Reasoning	Captioning Score	33.063
	QA Accuracy (Vehicle-view)	64.70%

6. Prevention Strategy Generation

The incident report generated by the LLM annotates contributing factors for each phase and recommends countermeasures. For example:

“Smartphone distraction in Recognition” $\rightarrow$ “Deploy dynamic warning signs when pedestrians detected”
“Reversing vehicle in Action” $\rightarrow$ “Add reverse-camera/sonar on vehicles”
“Lack of sidewalks in Judgment” $\rightarrow$ “Install pedestrian refuge islands or sidewalks”

By mapping phase-specific observations to actionable prevention measures, MP-PVIR emphasizes not only incident description but intervention—bridging observed behavioral dynamics to concrete traffic safety improvements.

7. Significance and Practical Implications

MP-PVIR constitutes a systematic advance in the field of automated traffic incident analysis by:

Integrating behavioral phase theory with advanced multi-view, phase-aware VLMs.
Achieving state-of-the-art performance in segmentation, dense captioning, and video QA using synchronized, multi-view video data.
Providing causal diagnostics and targeted recommendations that directly support safety interventions.

A plausible implication is that widespread adoption of MP-PVIR-like frameworks could augment vehicle-infrastructure cooperation systems by providing not only post hoc analyses but also potentially real-time prevention advisories in next-generation urban mobility environments (Zhen et al., 18 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Multi-view Phase-aware Pedestrian-Vehicle Incident Reasoning Framework with Vision-Language Models (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multi-view Phase-aware Pedestrian-Vehicle Incident Reasoning (MP-PVIR).