BaGLM: Bayesian Grounding with Multimodal Models
- The paper presents a training-free, online Bayesian filtering framework that refines zero-shot LMM predictions for accurate video step grounding.
- It employs a dependency matrix to model structural and temporal transitions, enhancing the alignment of sequential video segments with natural language instructions.
- Empirical results demonstrate significant improvements, with gains up to 14.2% in step grounding accuracy over traditional supervised methods.
Bayesian Grounding with Large Multimodal Models (BaGLM) refers to the integration of probabilistic filtering principles with large-scale multimodal model inference for the purpose of temporally and semantically aligning sequential input (e.g., streaming video frames) to structured concepts or procedural steps, typically specified in natural language. The approach emphasizes online, training-free operation by leveraging zero-shot capabilities of large multimodal models (LMMs) and injects temporal awareness and structural prior knowledge via Bayesian mechanisms such as dependency matrices and predictive filtering. BaGLM is distinguished from prior methods by its ability to operate without offline task-specific training and its explicit modeling of step transitions and progress using Bayesian filtering, resulting in state-of-the-art performance in streaming video step grounding tasks (Zanella et al., 19 Oct 2025).
1. Conceptual Foundations
The BaGLM paradigm emerges from the challenge of Video Step Grounding (VSG), where a system must identify which steps—described in natural language—are performed in a continuously streaming video. Previous methods require labeled datasets and offline processing; BaGLM adopts a training-free approach utilizing powerful LMMs that generalize via zero-shot inference. The essential idea is to replace independent per-segment LMM predictions with a temporally coherent estimate—updated at each timestep using Bayesian filtering—which integrates predictions from the past, models structural priors over possible step transitions, and incorporates evidence from the current video segment.
Bayesian grounding operates in two stages: (i) prediction—propagating beliefs via a learned transition model built from dependencies among steps, and (ii) update—integrating frame-wise multimodal likelihoods. The overall process recursively refines the probability that a given step is present through time, informed both by semantic dependencies and current observations.
2. Mathematical Formalization
BaGLM formalizes step grounding as a sequence of filtered posterior distributions over actions. For time , given action set and observed segment , the belief state is recursively computed as:
where:
- is the likelihood from the LMM over possible actions for the current segment,
- is the step transition probability,
- is the belief at the previous timestep,
- is the normalization factor.
The transition probabilities are modulated by a dependency matrix , extracted offline using LLM querying, which encodes prerequisite relations among steps. Readiness and validity scores further modulate transition weights, capturing whether all prerequisites for a step have been met and if successors remain incomplete:
The adjusted dynamic transition matrix is defined as:
This enables full Bayesian temporal filtering, using both model output and task structure.
3. Zero-Shot LMM Inference
BaGLM leverages the zero-shot capacity of LMMs for per-segment classification. For each current segment, a multi-choice prompt is constructed containing all candidate step descriptions (plus a “none” option), and the LMM predicts a step likelihood vector without any task- or domain-specific tuning. These likelihoods serve as the observation model in the Bayesian update.
Empirical results demonstrate that zero-shot LMM predictions already outperform many training-based methods when applied independently. However, temporal inconsistencies may arise when segments are ambiguous or lack explicit cues; the Bayesian filtering corrects these errors by considering plausible transitions conditioned on history.
4. Incorporation of Step Transition Structure
A key contribution of BaGLM is its integration of step dependencies as a structural prior in the filtering process. The dependency matrix is constructed by querying an LLM for step-to-step relationships (e.g., “Is step a_j a prerequisite for a_i?”), with thresholds for binary assignment. This encodes which steps must precede others. Step “progress” metrics, assessed via LLM output or auxiliary prompts, are tracked for each step to dynamically adjust the transition probabilities.
Step readiness quantifies if prerequisites are complete, while step validity indicates if successors remain unfinished. Multiplying these factors yields a time-varying transition model that respects task structure, improving grounding accuracy especially in ambiguous or repetitive instructional scenarios.
5. Online Bayesian Filtering for Streaming Video
BaGLM operates online as new video segments arrive:
- Predict Step: The filtered prior over steps is propagated via the adjusted transition matrix according to observed progress.
- Update Step: For each segment, the LMM likelihood vector is computed, and the prior is updated using Bayes’ rule.
- Belief Refinement: The posterior distribution informs both the step label and confidence at each timestep, continuously reflecting both new evidence and accumulated history.
This methodology supports real-time processing in applications such as cooking guidance, maintenance support, or AR-based instruction, with no need for system retraining or dataset annotation.
6. Empirical Results and Performance
Experimental evaluations on HT-Step, CrossTask, and Ego4D Goal-Step datasets show that:
- Zero-shot LMM predictions (e.g., InternVL2.5-8B) match or surpass state-of-the-art training-based models in offline settings.
- BaGLM’s Bayesian filtering yields substantial further improvements: on HT-Step, a 4.3 percentage point gain over NaSVA; on CrossTask/Ego4D, 13.1–14.2% improvements. These gains are achieved with pure online inference and without additional training.
This demonstrates that online Bayesian grounding with structured priors and temporal filtering successfully leverages the strengths of modern LMMs while addressing limitations of prior approaches.
7. Implications and Future Directions
BaGLM shows that combining model-based predictions from large multimodal systems with Bayesian principles for temporal and structural reasoning can outperform both naive zero-shot approaches and traditional fully supervised methods. Future work may focus on:
- Enhancing the granularity of structural priors (e.g., hierarchical task graphs).
- Extending Bayesian filtering to more complex inputs such as multimodal sensor streams.
- Improving progress estimation via self-consistent LMM mechanisms or uncertainty quantification for action boundaries.
BaGLM’s training-free, adaptive architecture is particularly suited for scalability and deployment in annotation-scarce or rapidly evolving domains, providing a robust solution for streaming multimodal grounding in procedural video analysis (Zanella et al., 19 Oct 2025).