- The paper presents a unified framework for video understanding by integrating low-level geometry with high-level semantic analysis, advancing scalable foundation models.
- It reviews methods in depth estimation, camera pose estimation, optical flow, and point tracking, highlighting feed-forward and diffusion-based techniques.
- The paper discusses unified video models that combine video QA, segmentation, and generative tasks, paving the way for robust, interactive video agents.
Video Understanding: From Geometry and Semantics to Unified Models
Introduction
"Video Understanding: From Geometry and Semantics to Unified Models" (2603.17840) presents a comprehensive and structured survey of modern techniques for video understanding, systematically classifying progress along three orthogonal axes: low-level video geometry, high-level semantic understanding, and unified video understanding models. The review elucidates the transition from isolated, task-specific pipelines towards holistic, unified models capable of flexible adaptation to diverse downstream objectives. A central thesis is that geometric and semantic understanding, historically pursued independently, are increasingly intertwined, and that integrated representations are foundational for constructing robust, scalable video foundation models.
Low-Level Video Geometry Understanding
The survey's first major axis covers the prediction and exploitation of physically-grounded geometric representations from RGB video sequences, with an emphasis on depth estimation, camera pose estimation, optical flow, and long-term point tracking. Recent advances have moved away from per-task, optimization-heavy approaches toward unified, data-driven feed-forward architectures.
Video Depth Estimation is categorized into inference-time alignment (relying on post-hoc optimization, sensitive to motion cues), feed-forward prediction (direct network-based temporal coupling), and diffusion-based generative methods (leveraging video diffusion priors). Strong quantitative improvements are demonstrated, with models like DepthFormer [li2023depthformer] and fast feed-forward architectures (e.g., DUST3R [wang2024dust3r]) delivering high geometric fidelity and temporal stability, while diffusion models further enhance realism, albeit at higher compute cost.
Camera Pose Estimation remains anchored by two paradigms: classical correspondence-and-solver pipelines and end-to-end pose regression. The former provides geometry-grounded accuracy through robust feature matching and minimal solvers, forming the basis of most high-precision SLAM/visual localization systems. Pose regression, especially with recent scaling in model and dataset size, narrows the gap, providing efficient, flexible alternatives.
Optical Flow and Point Tracking research has evolved from short-range, dense flow estimation to long-horizon, identity-preserving point tracking (TAP paradigm), addressing challenges of occlusion and viewpoint change. Modern global trackers integrate foundation model representations, memory mechanisms, and multi-point tracking (e.g., TAPIR [doersch2023tapir], CoTracker3 [karaev2024cotracker3]), achieving robust, temporally coherent performance.
Joint Feed-Forward Geometry Models represent a significant milestone, enabling simultaneous prediction of multiple geometric primitives (camera pose, depth, correspondences) within unified networks (e.g., VGGT [wang2025vggt], π3 [wang2025pi], CUT3R). This delivers mutual consistency, efficiency, and scalability, and serves as a bridge to higher-level semantic and generative modeling. Notable is the extension to dynamic and streaming-capable architectures, facilitating real-time updates in non-rigid, long-horizon video streams.
High-Level Video Semantic Understanding
At the semantic level, the survey covers segmentation, tracking, and temporal grounding, focusing on spatiotemporal reasoning and cross-modal integration.
Video Segmentation is organized into class-aware, open-vocabulary, and class-agnostic paradigms. Initial approaches relied on frame-to-frame propagation and optical flow, later superseded by transformer-based and diffusion-powered methods capable of leveraging large-scale data, language grounding, and generalized prompts (e.g., SAM2 [ravi2024sam2], SAM3 [carion2025sam], open-vocabulary segmentation [wang2023towards]). This evolution supports simultaneous object identity, panoptic segmentation, and context-aware semantics over time.
Video Object Tracking has moved from RGB-only, appearance-based matching to unified, multimodal and modality-agnostic trackers, capable of utilizing complementary sensory information (e.g., depth, thermal, event cameras). State-of-the-art approaches employ shared-transformer backbones, robust multimodal fusion (e.g., FlexTrack [tan2025you], ViPT [zhu2023visual]), and adaptive SSMs for long-term context. Evaluation under missing modality regimes reveals strong but not yet fully robust cross-domain generalization, underscoring ongoing challenges in sensor invariance and dynamic adaptation.
Video Temporal Grounding (VTG) demands fine-grained temporal localization in response to complex language queries. The field has moved from proposal-based, fully supervised regressors to MLLM-based models incorporating pretraining, instruction tuning, and RL-based objectives (e.g., VTimeLLM, TimeChat, Time-R1 [wang2025time]). These models support open-vocabulary, zero-shot, and reasoning-centered grounding. However, limitations in temporal resolution, semantic generalization, and inference efficiency remain, particularly for training-free, zero-shot inference pipelines.
Unified Video Understanding Models
Addressing the growing need for holistic video agents, recent research is consolidating geometry and semantics (and, increasingly, generation) within single unified architectures.
VideoQA benchmarks and architectures have evolved to test not only descriptive recognition but diagnostic, causal, and long-form reasoning—culminating in benchmarks like EgoSchema and CinePile that require multi-hop and "needle-in-a-haystack" retrieval. LMMs and neuro-symbolic hybrids (e.g., Video-LLaVA [lin2024video], Cambrian-S [yang2025cambrian], MoReVQA) scale both context and interpretability, but confront diminishing returns with naive context expansion. Explicit world modeling and predictive attention emerge as requisite capabilities for robust, scalable VideoQA.
Unified Video Understanding and Generation Models (UMMs) now span assembled (e.g., HuggingGPT [shen2023hugginggpt]), native AR transformer-based, and hybrid (e.g., Show-o2 [xie2025showo2], BAGEL [deng2025emerging]) architectures. These systems support end-to-end QA, captioning, and video synthesis/video editing within a single interface. AR designs offer training unification, but are constrained by context limitations and trade-offs in quality versus efficiency; hybrid models (AR+diffusion/flow) achieve enhanced realism and consistency, at the expense of architectural complexity and balancing instruction faithfulness with fidelity. Modular tool systems trade practical extensibility for end-to-end coherence. Despite rapid progress, degrading performance over long horizons, instruction alignment, and robustness to prompt variability remain key bottlenecks.
Implications and Outlook
The consolidation of geometric and semantic video understanding—increasingly unified with video generation—marks a shift toward true world models that can actively perceive, reason, and predict. Architectures highlighted in this survey are foundational for general-purpose, interactive video agents, and underpin advances in robotics, spatial reasoning, and embodied AI.
However, several open challenges must be addressed:
- Integration of Geometry and Semantics: Achieving reliable, mutually reinforcing representations remains unsettled, especially under dynamic, ambiguous conditions.
- Memory and Scalability: Long-horizon reasoning and stateful modeling are impeded by context and compute limitations. Principled advances in memory architectures, adaptive state representations, and streaming computation are critical.
- Uncertainty, Decision-Making, and World Modeling: Future models must accommodate stochastic environments, facilitating not only deterministic prediction but also hypotheticals, planning, and agent-centric interaction.
- Unified Evaluation and Robustness: Real-world deployment exposes multimodal systems to lossy sensors, partial observability, and adversarial conditions, motivating advances in robustness, cross-modal compensation, and uncertainty estimation.
Conclusion
This survey provides an in-depth taxonomy and synthesis of current developments in video understanding, emphasizing the convergence from task-specific to unified, multimodal models integrating geometry, semantics, and synthesis. The field is moving decisively towards large-scale, memory-centric, predictive world models supporting long-horizon reasoning, active perception, and robust decision making. Continued progress along these lines will be central to building reliable, general-purpose video agents and scalable video foundation models for real-world applications.