Video-SALMONN S: Streaming Audio-Visual LLM

Updated 14 October 2025

Video-SALMONN S is a streaming audio-visual large language model that efficiently processes extended high-resolution video streams using adaptive test-time training memory.
It integrates a Hessian-free conjugate-gradient optimizer to update fast weights in real-time, preserving long-range dependencies from multi-hour videos.
The model achieves state-of-the-art performance on long-video benchmarks, showing 10–20% improvements over traditional offline and streaming methods.

video-SALMONN S is a streaming audio-visual LLM designed to process extended video streams under a fixed memory budget, with particular emphasis on high-frame-rate, high-resolution, multi-hour video understanding. Its architecture uniquely combines a test-time-training (TTT) memory module and a prompt-dependent memory reader, which enable the model to preserve long-range dependencies and efficiently retrieve context-relevant information from memory. This model achieves state-of-the-art performance in long-video benchmarks, demonstrating substantial improvements over conventional offline and streaming approaches.

1. Architectural Foundations

The architecture of video-SALMONN S centers on two key components:

Test-Time-Training (TTT) Memory Module: Rather than relying on token merging or discarding, TTT continually updates token representations by adapting fast weights at inference time. For each incoming video frame, encoded visual and audio tokens are processed through an MLP parameterized by a memory state $W_t$ ("fast weights"). The adaptation is conducted using a Hessian-free conjugate-gradient optimizer (TTT_HF), which efficiently updates $W_t$ based on a token reconstruction loss:

$\mathcal{L}(X_t; W_{t-1}) = \lVert f(\theta_K X_t; W_{t-1}) - \theta_V X_t \rVert_2$

Here, $f(\cdot; W)$ is an MLP with residual connections and layer normalization; $\theta_K$ and $\theta_V$ are learned projections. The parameter update $\Delta W_t^{HF}$ solves

$B \Delta W_t^{HF} = -\eta_t \nabla_W \mathcal{L}(X_t; W_{t-1}),$

minimizing the loss with second-order information.

Prompt-Dependent Memory Reader: At decoding, rather than uniformly attending to the entire memory store, the model retrieves context-relevant tokens by computing attention scores between the prompt and stored key-value pairs. The mechanism selects the top $K'$ tokens per layer, enabling efficient question answering over a fixed-size yet large memory (up to 128k tokens), which is critical for multi-hour video analytics.

2. Memory Management and Streaming

To enable continuous processing of long video streams, video-SALMONN S maintains a fixed-size memory:

Token Discaring and Combination: For every new frame, output tokens $Z_t$ from the TTT_HF layer are appended to the memory. Cosine similarity is computed between adjacent tokens, and the $K$ most similar tokens are removed so that $N$ tokens are retained, maintaining memory constraints across arbitrarily long sequences.
Information Retention: Critically, even when tokens are discarded (e.g., due to high similarity), the dynamic memory $W_t$ has already absorbed information from these tokens through the TTT_HF adaptation. This mitigates the over-smoothing effect of token merging in conventional models, allowing the memory to accumulate long-term contextual signals.

Prompt-dependent reading further reduces computational cost by limiting attention to only those memory tokens most relevant to the user query.

3. Benchmark Performance

video-SALMONN S outperforms both offline and contemporary streaming methods on established long-video understanding benchmarks:

Benchmark	Model Size	Overall Accuracy	Long Split Accuracy
Video-MME	8B	74.2%	67.8%
LVBench, VideoEvalPro	8B	-	-

The model sustains high-quality performance over streams with up to 10,000 frames and more than 1 million tokens, whereas traditional models must adapt frame rates or restrict analysis to short clips, resulting in substantial information loss for longer videos. On Video-MME, the long split (requiring memory of video content over several hours) reveals that video-SALMONN S provides a 10–20% absolute improvement versus previous offline or streaming baselines.

4. Applications

The streaming and memory-efficient design of video-SALMONN S enables several high-impact applications:

Video Surveillance: Continuous anomaly detection and event summarization in multi-hour security footage.
Media Analysis: Real-time captioning, summarization, and retrieval in long-form content, including movies and live events.
Education: Automated lecture transcript generation and interactive content synthesis from extended video records.
Autonomous Agents: Process monitoring and log analysis for systems requiring awareness over many hours, such as autonomous vehicles or manufacturing lines.

Efficient long-term context retention mitigates the Transformer context-size bottleneck, supporting deployment in both resource-rich and memory-constrained environments.

5. Technical and Optimization Advances

The TTT_HF memory module, using Hessian-free conjugate-gradient adaptation, offers both efficiency and fidelity in updating fast weights under streaming conditions. The prompt-dependent retrieval mechanism permits context-sensitive inference without the need to process all stored tokens, reducing inference latency and resource requirements even as the underlying memory grows.

The combined strategy of similarity discarding and continual fast weight updating ensures model scalability for high-frame-rate, high-resolution video in nearly unbounded durations.

6. Future Perspectives

Further research directions for video-SALMONN S include:

Extending to even higher resolutions and frame rates while maintaining fixed memory budgets.
Refining the TTT_HF update mechanism, potentially integrating advanced second-order optimizers or hybrid memory architectures.
Developing enhanced prompt-dependent retrieval algorithms, and incorporating reinforcement learning for context selection.
Adapting the framework for domain adaptation and real-time robustness under shifting streaming video distributions.

A plausible implication is that these advances may allow streaming audio-visual LLMs to be deployed for continuous, interactive video analytics and decision support in a broad range of fields from surveillance to education, fundamentally altering expectations for context length and memory limits in foundation video models.

7. Significance and Impact

video-SALMONN S demonstrates that combining adaptive test-time memory and prompt-sensitive retrieval within a streaming architecture enables sustained, high-quality video understanding over extended durations. Its results on benchmarks establish new performance baselines for long-stream video analytics, highlighting the feasibility of fixed-memory processing for practical AI agents and laying conceptual groundwork for future research into streaming, multi-modal LLMs (Sun et al., 13 Oct 2025).

PDF Markdown Chat (Pro)

References (1)

video-SALMONN S: Streaming Audio-Visual LLMs Beyond Length Limits via Memory (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Video-SALMONN S.