Structured Causal Video Reasoning via Multi-Objective Alignment

Published 6 Apr 2026 in cs.CL | (2604.04415v1)

Abstract: Human understanding of video dynamics is typically grounded in a structured mental representation of entities, actions, and temporal relations, rather than relying solely on immediate deductive reasoning. In contrast, existing Video-LLMs largely depend on unstructured video reasoning, where critical visual evidence is embedded in verbose textual descriptions and temporal causality is often weakly modeled. This leads to inefficient processes and fragile causal inference. To bridge this cognitive gap, we propose constructing a compact representation of salient events and their causal relationships, which we name Structured Event Facts, prior to the reasoning stage. This structured prior serves as an explicit constraint to promote concise and causally grounded reasoning, while also making intermediate evidence easier to verify. To effectively train models on such structured facts, we introduce CausalFact-60K and a four-stage training pipeline comprising facts alignment, format warm-start, thinking warm-start, and reinforcement learning-based post-training. During RL stage, we find that this framework introduces competing objectives, as structural completeness and causal fidelity must be balanced against reasoning length, making it difficult to optimize. We address this challenge by formulating the optimization as a Multi-Objective Reinforcement Learning (MORL) problem and explicitly optimizing toward the Pareto-Frontier to balance these trade-offs. As a result, we introduce Factum-4B, which yields more reliable reasoning and delivers stronger performance on challenging video understanding tasks requiring fine-grained temporal inference.

Abstract PDF Upgrade to Chat

Authors (10)

Summary

The paper proposes a structure-first paradigm by introducing Structured Event Facts to ground video reasoning in verifiable causal evidence.
It leverages a novel P-FAB algorithm within a multi-objective reinforcement learning framework to dynamically balance factual completeness, format adherence, and brevity.
Empirical results on the Factum-4B model show significant gains in temporal grounding and overall video understanding compared to unstructured reasoning methods.

Structured Causal Video Reasoning via Multi-Objective Alignment: An Expert Analysis

Motivation and Limitations of Unstructured Video Reasoning

Recent advances in Video-Large Vision-LLMs (Video-LVLMs) leverage the chain-of-thought (CoT) paradigm, yet prevalent approaches predominantly rely on unstructured reasoning pathways. This leads to verbose outputs, diluted visual cues, and fragile causal inference. The central insight of "Structured Causal Video Reasoning via Multi-Objective Alignment" (2604.04415) is that unstructured CoT, effective in text-only LLMs, fails in the video domain due to high spatiotemporal redundancy and under-exploitation of fine-grained event structure.

Empirically, models enhanced with traditional CoT reasoning underperform against instruction-tuned baselines, with performance regressions consistently observed. The root cause is “reasoning drift,” where the model loses focus on the query and ignores causality, in stark contrast with the structured, event-centric cognitive process of human observers.

Structured Event Facts and the Structure-First Paradigm

The authors propose a strict "Structure-First" paradigm, introducing Structured Event Facts as an explicit, high-density schema of key events, their spatiotemporal attributes (time, person, action, scene, object, camera), and a causal event caption. This compact representation is generated prior to any downstream reasoning, serving as an anchor and constraint for subsequent causal deduction. This approach prioritizes event saliency, temporal grounding, and verifiable intermediate evidence, aligning model cognition with proven human event parsing mechanisms.

CausalFact-60K Dataset and Progressive Training

To train models to extract, structure, and utilize event facts, the authors introduce CausalFact-60K—a dataset emphasizing temporally dense, causally rich event annotation. The training pipeline is decomposed into four progressive stages:

Facts Training: Instruction tuning to produce detailed Structured Event Facts.
Format Warm-Start: Strict format enforcement to internalize the schema and avoid hallucinated structures.
Thinking Warm-Start: Instruction tuning for structured causal reasoning, leveraging the facts scaffold.
Reinforcement Learning (RL) Post-Training: Optimization under multi-objective constraints using both scalar and complex rewards.

This structure imposes a strong causal prior and delivers a format in which reasoning remains concise, salient, and directly grounded in visual evidence.

Multi-Objective Optimization and P-FAB

The integration of structured facts with downstream reasoning introduces non-trivial multi-objective trade-offs, especially under the restricted textual token budget demanded by RL. Standard RL algorithms such as PPO or GRPO proved insufficient, as they collapse reward signals and cannot dynamically resolve conflicts (e.g., between factual completeness and brevity).

The authors formulate the RL stage as a true Multi-Objective Reinforcement Learning (MORL) problem. The core contribution is the Pareto-Frontier guided Advantage Balancing (P-FAB) algorithm. P-FAB:

Treats reward signals (facts, format, length, performance) as independent, not scalarized,
Dynamically adjusts optimization direction to approximate the Pareto frontier in the multi-objective space,
Employs the Multiple Gradient Descent Algorithm (MGDA) principle for resolving conflicts among objectives in a scale-invariant manner.

This results in optimization dynamics that promote rare but critical objectives (e.g., causal fidelity, strict format adherence) and avoid mode collapse along easier reward axes.

Empirical Results and Claims

The resulting Factum-4B achieves significant gains across both temporal grounding and general reasoning benchmarks, often surpassing much larger open-source baselines and in some tasks even proprietary models.

Temporal Grounding: On ActivityNet-Captions, Factum-4B achieves [email protected] of 48.4% and [email protected] of 28.1%, consistently outperforming open-source 7B-scale models, and narrowing the gap to closed-source titans (e.g., Gemini-2.5).
General Video Understanding: On VideoMME, MLVU, and ETBench, Factum-4B sets new performance records for open-source models, reaching 64.7% on VideoMME and 73.6% on NEXT-GQA, and surpassing GPT-4o in several event-localization and causal reasoning sub-tasks.
Ablation: Removal of either structured facts or the explicit reasoning process causes severe degradation (up to 7% absolute drops), demonstrating their necessity. P-FAB exhibits a clear advantage over GRPO, especially as group size increases, and provides further gains (2-3% absolute) under strong RL signal mixing.

The paper’s claims are well-supported by careful ablations and consistent benchmark improvements. The authors specifically highlight that unstructured “thinking” degrades video reasoning, in direct contrast with observations in text-only LLMs.

Implications and Future Directions

This work demonstrates that bridging the cognitive gap between human and machine video understanding requires explicit, structured event modeling and a constraint-driven, causally verifiable reasoning pipeline. The paradigm shift toward “Facts first, Reasoning second” not only yields higher accuracy but also superior interpretability and verifiability of outputs.

P-FAB’s MORL formulation provides a principled approach to solving multi-objective conflicts in RL for multimodal tasks, and is likely generalizable to other structured reasoning settings (e.g., procedural task understanding, robotic planning).

Scaling of the CausalFact dataset and further architectural innovations may extend Factum-4B’s robustness to even longer videos and more diverse real-world scenarios. The integration of finer-grained memory and retrieval mechanisms, or lifelong learning strategies, could amplify the impact of structured priors in maintaining causal consistency over extended temporal horizons.

Conclusion

"Structured Causal Video Reasoning via Multi-Objective Alignment" establishes that effective video reasoning by LMMs mandates not only more powerful architectures but fundamentally new training paradigms, centered on explicit, structured event extraction and rigorous, causally grounded reasoning. The CausalFact-60K dataset, the Factum-4B model, and the P-FAB optimization algorithm collectively advance the field by setting new standards in evidence-grounded, interpretable video understanding. These findings provide both a blueprint for scalable multimodal cognition and a clear demonstration of the limitations of unstructured chain-of-thought reasoning for dense temporal tasks.