LiveMamba: Streaming Multi-modal LLM
- LiveMamba is a streaming multi-modal LLM that provides real-time, interactive guidance by processing continuous video streams during procedural tasks.
- It integrates asynchronous vision and language modules to autonomously issue feedback and detect execution errors without explicit user prompts.
- Evaluated on the Qualcomm Interactive Cooking benchmark, it significantly improves instruction completion accuracy and mistake detection over turn-based approaches.
LiveMamba is a streaming multi-modal LLM architecture developed to deliver real-time, interactive step-by-step guidance for procedural tasks, first evaluated in the context of egocentric cooking videos. Unlike traditional turn-based video-LLMs, LiveMamba operates asynchronously on continuous video streams, autonomously issuing instructions, detecting execution, and rapidly alerting users to mistakes, all without the need for explicit prompts. Its development addresses fundamental limitations in turn-based approaches for live situated task coaching, establishing new methods and evaluation protocols for granular, responsive AI guidance (Bhattacharyya et al., 27 Nov 2025).
1. Motivation and System Overview
The primary challenge motivating LiveMamba is the inadequacy of turn-based multi-modal LLMs in live instructional scenarios. Conventional approaches require explicit user queries and process static video chunks, producing responses in a query–response loop. This paradigm precludes timely feedback and precludes the detection of subtle or immediate execution errors during tasks.
LiveMamba’s design supports:
- Autonomous issuance of the next instruction without waiting for user input.
- Live monitoring of video streams to detect completion of instructions.
- Immediate alerting on mistakes, even when subtle (e.g., incorrect measurement).
- Fine-grained “when-to-speak” control to decide at which exact frame to emit instructions, success feedback, or mistake alerts.
- Efficient and low-latency end-to-end inference over long egocentric video streams.
- Adaptive plan handling using an external re-planner in scenarios with divergence, such as out-of-order or missing-steps.
These requirements necessitate a streaming, asynchronous inference pipeline, as well as training data reflecting both correct and erroneous executions.
2. Dataset and Benchmarking: Qualcomm Interactive Cooking
LiveMamba is evaluated on the Qualcomm Interactive Cooking benchmark, which builds upon the CaptainCook4D dataset of 384 egocentric cooking videos paired with graph-structured, temporally segmented recipes. The key enhancements in the benchmark are:
- Timed instructions: Each instruction is timestamped to coincide with the start of the corresponding action group.
- Success feedback: Acknowledgments are timestamped at action end, confirming completion.
- Mistake alerts: Mistakes receive precise visual timestamps of their occurrence (excluding certain advanced divergences requiring reasoning).
The dataset is stratified as follows:
| Subset | Videos | Instructions | Success Feedback | Mistake Feedback | Duration (hours) |
|---|---|---|---|---|---|
| Main Set (train/val/test) | 213 | 2913 | 2394 | 686 | 52.4 |
| Advanced Planning Set | 209 | 2888 | – | – | 51.5 |
The Advanced Planning Set introduces order errors and missing-step divergences, requiring recipe graphs to be dynamically replanned using Kahn’s algorithm.
Evaluation metrics include:
- Instruction Completion Accuracy (IC-Acc): Correctly detected completions within ±30 s window.
- Mistake Detection (Precision, Recall, F1): True/false positives and negatives for mistake alerts within a temporal window.
- Mistake Feedback Fluency: BERTScore and ROUGE-L on generated feedback for true positive detections.
- Turn-based vs. Streaming Evaluation: Assessing isolated steps (turn-based) versus continuous live feedback (streaming).
3. LiveMamba Architecture and Streaming Pipeline
LiveMamba integrates state-of-the-art vision and language modules under a streaming control loop:
- Vision Encoder: InternViT-300M processes each 448px frame at 2 fps, producing approximately 1025 patch tokens. A Q-Former adapter with 4 cross-attention layers reduces this to 32 visual tokens.
- Language Backbone: Mamba-130M, a linear-time recurrent LLM, maintains hidden state and supports long-range context updates.
- Controller Protocol: Special tokens— (to ingest new frames) and (to emit textual output)—govern when the model should process input or deliver feedback.
- External Re-planner: Qwen3-32B is used only in Advanced Planning Set cases to handle user divergence from the canonical task plan.
At each time step , the inference pipeline computes: When the model emits in , an instruction or feedback is output, retaining across steps for cumulative memory.
Training objectives combine cross-entropy over next-token prediction, implicit alignment of emissions to event timestamps, and losses for mistake detection: with augmentations (IC-Augment, CF-Augment) ensuring coverage of both correct and mistake events.
4. Real-Time Feedback and Adaptive Planning
LiveMamba is optimized for instantaneous feedback and mistake alerting. During fine-tuning, error events are temporally jittered (±30 s) to build robustness to annotation drift. At inference, there is no post-processing threshold: the model outputs with an alert or remains silent, naturally regulating false positives.
When a divergence is detected (e.g., user fails to follow an instruction), LiveMamba invokes the external re-planner via Qwen3-32B to:
- Retrieve and identify the actual performed step.
- Assess whether to repeat or skip the previous instruction.
- If skipping, adjust the plan graph accordingly before issuing the next step.
Latency assessments on NVIDIA H100 show LiveMamba operates at 8.1 fps for 2 fps video, with first-token latency at 1.1 s (re-planning per divergence adds 6.1 s).
5. Experimental Evaluation and Results
Quantitative and qualitative results establish LiveMamba as a new baseline for live interactive task guidance:
- Zero-shot Baselines: State-of-the-art video-LLMs (e.g., Gemini-2.5-Flash) achieve only 23.1% IC-Acc and near-zero mistake detection F1, underscoring the limitations of turn-based approaches.
- Fine-tuned LiveMamba: On the Main Set, fine-tuned LiveMamba yields 31.5% IC-Acc and F1=0.13, more than tripling the baseline accuracy. Ablation studies show that removing IC-Augment cuts IC-Acc by over half (to 14.3%), and omitting CF-Augment reduces F1 to 0.05. On the Advanced Planning Set, adding re-planning increases IC-Acc from 10.9% to 12.6% and F1 to 0.19.
- Turn-based Evaluation: With error propagation eliminated, IC-Acc rises to 51.0% and F1 to 0.19.
- Qualitative Performance: LiveMamba demonstrates timely recognition of correct and erroneous actions; for example, triggering heating instructions upon successful slicing, or immediately flagging an incorrect cup size in an ingredient transfer.
6. Limitations and Future Directions
LiveMamba’s current scope and performance are subject to several limitations:
- Evaluation is restricted to egocentric cooking, with domain transfer yet untested.
- Subtle or compound errors, particularly those involving planning divergences, remain challenging.
- The need for large external re-planners introduces latency.
Planned advancements include:
- Applying LiveMamba in procedural domains beyond cooking.
- Integrating re-planning within the main model for lower latency.
- Incorporating audio or speech inputs for richer user interaction.
- Expanding mistake taxonomy to cover order-errors and missing steps with in-model plan adjustment.
- Enhancing computational efficiency through better vision–language fusion or knowledge distillation.
LiveMamba thus represents a substantive advance in live, adaptive multi-modal AI coaching, combining streaming architectures, data augmentation, and adaptive planning to move multi-modal LLMs beyond conventional turn-based interaction toward real-time situated task guidance (Bhattacharyya et al., 27 Nov 2025).