Post-Training Methods for Video-LMMs
- Post-training for Video-LMMs is a set of advanced techniques that refine basic visual-language models into sophisticated video reasoning engines.
- Supervised fine-tuning with chain-of-thought supervision improves interpretability by generating explicit reasoning steps anchored in video evidence.
- Test-time scaling strategies, including beam search and iterative reasoning, boost performance on long, complex videos by optimizing inference processes.
Post-training methodologies for Video-Large Multimodal Models (Video-LMMs) refer to the suite of techniques applied to transform a pretrained visual-language system from basic perception capabilities to sophisticated, video-aware reasoning engines. These methods operate after initial large-scale pretraining and are crucial for bridging the gap between perceptual understanding and advanced video reasoning, including temporal localization, spatiotemporal grounding, and efficient handling of long, multimodal content. The core pillars of post-training in this context encompass supervised fine-tuning (SFT), reinforcement learning (RL) with verifiable objectives, and test-time scaling (TTS) through advanced inference strategies. Together, they constitute a structured taxonomy that enables systematic advancement of Video-LMM capabilities (Tang et al., 6 Oct 2025).
1. Supervised Fine-Tuning (SFT) with Chain-of-Thought
The initial post-training stage typically uses supervised fine-tuning, where the Video-LMM is explicitly trained to produce structured, intermediate reasoning steps before emitting a final answer. This is achieved through chain-of-thought (CoT) supervision, in which each target output consists of interconnected “think” and “answer” tokens grounded in visual evidence (e.g., frame numbers, scene graphs). The SFT objective for these sequences can be expressed as:
where denotes the trainable parameters, the input video and prompt, and the CoT token trajectory (Tang et al., 6 Oct 2025).
CoT supervision is sourced either from manual annotation or automatically synthesized using video metadata. This step improves interpretability by enforcing explicit, multi-step reasoning, and it provides a robust initialization for subsequent RL-driven optimization phases.
2. Reinforcement Learning from Verifiable Objectives
After SFT, the pipeline transitions to reinforcement learning, which optimizes the model toward outcomes that can be objectively validated, such as spatiotemporal grounding or answer correctness. The survey highlights Group Relative Policy Optimization (GRPO) as a canonical RL algorithm, which computes rewards for each output rollout based on verifiable criteria (e.g., temporal overlap with ground truth):
where could be, for example, temporal intersection-over-union (IoU), presence of > /<answer> tokens, or semantic match. The coefficients are non-negative and sum to one.
A typical temporal localization reward is:
where is the predicted interval, is ground truth, bonus coefficients, IoU thresholds, and penalizes degenerate segment lengths.
GRPO and related algorithms (e.g., DPO, PPO) use rollouts grouped per prompt and compute normalized advantages within each group, optimizing a clipped surrogate objective with KL regularization. A critical principle is the use of purely verifiable objectives—leveraging features such as timestamps or region annotations—thus minimizing reliance on subjective or preference-based judgments (Tang et al., 6 Oct 2025).
3. Test-Time Scaling (TTS) and Enhanced Inference
Test-time scaling encompasses a set of inference-stage strategies that improve output quality by allocating additional computation or adopting higher-level reasoning schemes. These involve:
- Beam Search: Traverses multiple candidate answer sequences to increase answer fluency and correctness.
- Video Chain-of-Thought Prompting: Iteratively decomposes long-form tasks into manageable subproblems, invoking multi-stage reasoning and aggregating sub-answers.
- Self-Consistency Decoding: Samples multiple reasoning paths and aggregates the results (e.g., majority vote or weighted aggregation) to select the most probable output.
- Confidence-Based Iterative Reasoning: Monitors model uncertainty, triggering output refinement or additional evidence collection until confidence thresholds are met.
- Monte Carlo Tree Search (MCTS): Systematically explores the space of possible next steps in open-ended tasks such as video captioning.
TTS strategies are especially critical when handling extended video durations and complex, multimodal query types, as they facilitate targeted computational allocation and multimodal evidence integration during inference (Tang et al., 6 Oct 2025).
4. Taxonomy, Design Principles, and Interconnections
The taxonomy of video post-training comprises three principal pillars:
Pillar Functions and Focus Video-Specific Innovations Supervised Fine-Tuning Modality integration, CoT fine-tuning, domain alignment Explicit chain-of-thought for spatiotemporal grounding Reinforcement Learning Policy optimization, reward design Verifiable (not preference) RL, temporal/spatial rewards Test-Time Scaling Inference computation allocation, iterative refinement Multi-evidence fusion, chain-of-thought at inference Key design insights include:
- Intermediate reasoning steps should be grounded in video evidence (e.g., frame indices, key region identification).
- SFT “warms up” the model in a structured regime; RL stages exploit task-aligned, complex objectives.
- Extra computation at test time (TTS) is vital for scaling to long videos and multimodal inputs.
- Comparative studies should report cost/performance trade-offs (frames processed, inference speed).
5. Evaluation Protocols, Benchmarks, and Open Challenges
Post-training methodologies are evaluated using an array of domain-specific benchmarks:
- General Video QA: Datasets testing instruction following, reasoning, and multimodal integration (e.g., MMVU, VideoReasonBench).
- Grounding-Centric: Charades-STA, ActivityNet Grounding for event/object localization.
- Long/Streaming: LongVideo-Reason-eval, HLV-1K, and streaming frameworks, assessing efficiency, viewing budget, and accuracy under resource constraints.
Metrics include overall accuracy, temporal IoU (tIoU), region-based IoU, Recall@K, and explicit cost-performance indicators (e.g., number of frames, time-to-answer).
Open research challenges identified in the survey include the instability/variance in RL reward signals, scalability to truly long videos, reward misspecification, and cost-effectiveness of TTS protocols. The resource at https://github.com/yunlong10/Awesome-Video-LMM-Post-Training curates ongoing updates and implementation details (Tang et al., 6 Oct 2025).
6. Synthesis and Future Outlook
Post-training for Video-LMMs is a structured, multi-stage process that progressively transitions the model from perception to high-level video reasoning. It synthesizes supervised, explicit reasoning via chain-of-thought SFT; flexible, verifiably-aligned optimization via RL; and robust, cost-aware answer generation through TTS. Each stage is adapted to unique video-centric challenges such as temporal localization, long-range dependency modeling, and multimodal evidence fusion.
While substantial progress has been achieved, the field continues to face open questions in reward design, especially for verifiable, efficient, and generalizable objectives, as well as scalability in both model and inference computation. Advanced evaluation on targeted benchmarks and further innovation in TTS and RL reward architectures are primary directions for the next phase of research.