Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 179 tok/s
Gemini 2.5 Pro 51 tok/s Pro
GPT-5 Medium 40 tok/s Pro
GPT-5 High 35 tok/s Pro
GPT-4o 103 tok/s Pro
Kimi K2 207 tok/s Pro
GPT OSS 120B 451 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

Video-SALMONN S: Streaming Audio-Visual LLM

Updated 14 October 2025
  • Video-SALMONN S is a streaming audio-visual large language model that efficiently processes extended high-resolution video streams using adaptive test-time training memory.
  • It integrates a Hessian-free conjugate-gradient optimizer to update fast weights in real-time, preserving long-range dependencies from multi-hour videos.
  • The model achieves state-of-the-art performance on long-video benchmarks, showing 10–20% improvements over traditional offline and streaming methods.

video-SALMONN S is a streaming audio-visual LLM designed to process extended video streams under a fixed memory budget, with particular emphasis on high-frame-rate, high-resolution, multi-hour video understanding. Its architecture uniquely combines a test-time-training (TTT) memory module and a prompt-dependent memory reader, which enable the model to preserve long-range dependencies and efficiently retrieve context-relevant information from memory. This model achieves state-of-the-art performance in long-video benchmarks, demonstrating substantial improvements over conventional offline and streaming approaches.

1. Architectural Foundations

The architecture of video-SALMONN S centers on two key components:

  • Test-Time-Training (TTT) Memory Module: Rather than relying on token merging or discarding, TTT continually updates token representations by adapting fast weights at inference time. For each incoming video frame, encoded visual and audio tokens are processed through an MLP parameterized by a memory state WtW_t ("fast weights"). The adaptation is conducted using a Hessian-free conjugate-gradient optimizer (TTT_HF), which efficiently updates WtW_t based on a token reconstruction loss:

L(Xt;Wt1)=f(θKXt;Wt1)θVXt2\mathcal{L}(X_t; W_{t-1}) = \lVert f(\theta_K X_t; W_{t-1}) - \theta_V X_t \rVert_2

Here, f(;W)f(\cdot; W) is an MLP with residual connections and layer normalization; θK\theta_K and θV\theta_V are learned projections. The parameter update ΔWtHF\Delta W_t^{HF} solves

BΔWtHF=ηtWL(Xt;Wt1),B \Delta W_t^{HF} = -\eta_t \nabla_W \mathcal{L}(X_t; W_{t-1}),

minimizing the loss with second-order information.

  • Prompt-Dependent Memory Reader: At decoding, rather than uniformly attending to the entire memory store, the model retrieves context-relevant tokens by computing attention scores between the prompt and stored key-value pairs. The mechanism selects the top KK' tokens per layer, enabling efficient question answering over a fixed-size yet large memory (up to 128k tokens), which is critical for multi-hour video analytics.

2. Memory Management and Streaming

To enable continuous processing of long video streams, video-SALMONN S maintains a fixed-size memory:

  • Token Discaring and Combination: For every new frame, output tokens ZtZ_t from the TTT_HF layer are appended to the memory. Cosine similarity is computed between adjacent tokens, and the KK most similar tokens are removed so that NN tokens are retained, maintaining memory constraints across arbitrarily long sequences.
  • Information Retention: Critically, even when tokens are discarded (e.g., due to high similarity), the dynamic memory WtW_t has already absorbed information from these tokens through the TTT_HF adaptation. This mitigates the over-smoothing effect of token merging in conventional models, allowing the memory to accumulate long-term contextual signals.

Prompt-dependent reading further reduces computational cost by limiting attention to only those memory tokens most relevant to the user query.

3. Benchmark Performance

video-SALMONN S outperforms both offline and contemporary streaming methods on established long-video understanding benchmarks:

Benchmark Model Size Overall Accuracy Long Split Accuracy
Video-MME 8B 74.2% 67.8%
LVBench, VideoEvalPro 8B - -

The model sustains high-quality performance over streams with up to 10,000 frames and more than 1 million tokens, whereas traditional models must adapt frame rates or restrict analysis to short clips, resulting in substantial information loss for longer videos. On Video-MME, the long split (requiring memory of video content over several hours) reveals that video-SALMONN S provides a 10–20% absolute improvement versus previous offline or streaming baselines.

4. Applications

The streaming and memory-efficient design of video-SALMONN S enables several high-impact applications:

  • Video Surveillance: Continuous anomaly detection and event summarization in multi-hour security footage.
  • Media Analysis: Real-time captioning, summarization, and retrieval in long-form content, including movies and live events.
  • Education: Automated lecture transcript generation and interactive content synthesis from extended video records.
  • Autonomous Agents: Process monitoring and log analysis for systems requiring awareness over many hours, such as autonomous vehicles or manufacturing lines.

Efficient long-term context retention mitigates the Transformer context-size bottleneck, supporting deployment in both resource-rich and memory-constrained environments.

5. Technical and Optimization Advances

The TTT_HF memory module, using Hessian-free conjugate-gradient adaptation, offers both efficiency and fidelity in updating fast weights under streaming conditions. The prompt-dependent retrieval mechanism permits context-sensitive inference without the need to process all stored tokens, reducing inference latency and resource requirements even as the underlying memory grows.

The combined strategy of similarity discarding and continual fast weight updating ensures model scalability for high-frame-rate, high-resolution video in nearly unbounded durations.

6. Future Perspectives

Further research directions for video-SALMONN S include:

  • Extending to even higher resolutions and frame rates while maintaining fixed memory budgets.
  • Refining the TTT_HF update mechanism, potentially integrating advanced second-order optimizers or hybrid memory architectures.
  • Developing enhanced prompt-dependent retrieval algorithms, and incorporating reinforcement learning for context selection.
  • Adapting the framework for domain adaptation and real-time robustness under shifting streaming video distributions.

A plausible implication is that these advances may allow streaming audio-visual LLMs to be deployed for continuous, interactive video analytics and decision support in a broad range of fields from surveillance to education, fundamentally altering expectations for context length and memory limits in foundation video models.

7. Significance and Impact

video-SALMONN S demonstrates that combining adaptive test-time memory and prompt-sensitive retrieval within a streaming architecture enables sustained, high-quality video understanding over extended durations. Its results on benchmarks establish new performance baselines for long-stream video analytics, highlighting the feasibility of fixed-memory processing for practical AI agents and laying conceptual groundwork for future research into streaming, multi-modal LLMs (Sun et al., 13 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Video-SALMONN S.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube