Papers
Topics
Authors
Recent
Search
2000 character limit reached

VideoLLaMB: Long-Context Video Understanding

Updated 2 June 2026
  • The paper introduces a recurrent memory bridge mechanism that preserves semantic continuity across extensive video contexts.
  • It leverages a frozen ViT-L/14 encoder and a Vicuna-7B language model with minimal parameter tuning to maintain linear GPU memory scaling.
  • Empirical results on VideoQA, planning, and retrieval benchmarks demonstrate significant performance improvements over baseline models.

VideoLLaMB (Long-context Video Understanding with Recurrent Memory Bridges) is a framework for efficient and semantically robust video-language modeling designed to process long video sequences while maintaining tractable computational and memory requirements. It employs a recurrent memory mechanism within bridge layers and introduces a model-free video segmentation algorithm, SceneTilling, to preserve semantic continuity across extensive video contexts. VideoLLaMB demonstrates notable performance improvements on multiple video question answering (VideoQA), planning, and retrieval benchmarks, while maintaining a linear GPU memory scaling profile. The architecture leverages a frozen ViT-L/14 video encoder and a Vicuna-7B-v1.5 LLM with modifications localized to a small set of bridge projection parameters (Wang et al., 2024).

1. Framework Architecture

VideoLLaMB’s architecture consists of modular components configured for minimal computational footprint and maximal extensibility:

  • Video Encoder: Utilizes a frozen ViT-L/14 model to extract per-frame embeddings in R1024\mathbb{R}^{1024}.
  • Language Backbone: Vicuna-7B-v1.5, context window 2048 tokens, with all parameters frozen except for the bridge projection.
  • Memory Bridge Layers: Transformer-based modules that take as input a fixed set of temporal memory tokens and segment-level vision tokens, outputting both an updated set of memory tokens and a summary for downstream language reasoning.
  • SceneTilling Segmenter: Algorithm to partition video into KK temporally coherent, semantically distinct segments.
  • Memory Cache & Retrieval: Cache maintaining all prior memory states for cross-attentive retrieval, mitigating vanishing gradients across extended contexts.

The architecture operates by sampling video frames, segmenting them using SceneTilling, extracting features for each segment, and iteratively updating recurrent memory tokens via bridge layers. The summarized output is projected for the LLM, which performs tasks such as question answering or caption generation.

2. Recurrent Memory Bridges and Computational Characteristics

The core of VideoLLaMB’s long-context processing is the recurrent memory bridge mechanism:

  • Memory Token Definition: At each time step tt, T=32T=32 memory tokens MtR32×1024M_t \in \mathbb{R}^{32\times 1024} encapsulate historical summary.
  • BridgeLayer Update: Operates over concatenated memory and current segment tokens,

[Mt,Ot]=BridgeLayer([Mt1;St])[M_t, O_t] = \mathrm{BridgeLayer}([M_{t-1}; S_t])

where StS_t contains CC frame embeddings. MtM_t is then re-contextualized via cross-attention into a cache Mt1\mathcal{M}_{t-1} of all previous memory tokens:

KK0

This operation enables selective retrieval from extended history while preventing loss of earlier information.

  • Computational Complexity: By constraining bridge attention to KK1 tokens per segment, overall memory and compute cost is KK2 for a video of KK3 segments (typically KK4 at training, up to KK5 at test). This linear scaling is maintained in practice, enabling VideoLLaMB to process 320 frames on a single Nvidia A100/A800 GPU with a modest increase in memory usage (from 11GB for 16 frames to 24GB for 320 frames).

3. SceneTilling Algorithm: Semantic Video Segmentation

SceneTilling is a model-free segmentation approach developed to preserve semantic structure and minimize information dilution during recurrent summarization:

  • Algorithmic Procedure: For an input sequence KK6, with embeddings KK7, pairwise cosine similarities KK8 are computed. For each KK9,

tt0

A segmentation threshold tt1 is imposed, where tt2 are the mean and std of tt3. Frames at valley indices exceeding tt4 are used as split points for tt5 semantic segments.

  • Integration with Memory: Each segment is independently processed by memory bridges, ensuring self-attention operations remain within semantically homogeneous regions, thereby optimizing recurrent memory efficiency and retaining intra-segment detail.

4. Training Regime and Memory Scaling

VideoLLaMB is instruction-tuned on a public video dataset (∼1 million frames) split into 16-frame clips, with open QA pairs for supervision:

  • Training Configuration: Batch size 8, learning rate tt6, 1 epoch, 4×A800 GPUs; only the bridge parameters are updated, with cross-entropy loss over target text tokens.
  • Memory Efficiency: Owing to segment-localized attention, GPU memory usage increases linearly with the number of segments:
    • 16 frames: 11 GB
    • 32 frames: 13 GB
    • 64 frames: 16 GB
    • 320 frames: 24 GB
    • This contrasts with the quadratic scaling typical of global self-attention, enabling practical long-form reasoning on standard hardware.

5. Empirical Performance and Benchmarks

VideoLLaMB’s empirical superiority is demonstrated across diverse evaluation protocols:

  • EgoSchema VideoQA: For 32-frame context, VideoLLaMB achieves tt7% accuracy vs. PLLaVA's tt8%, an tt9 point advantage.
  • NExT-QA: Outperforms PLLaVA in temporal (T=32T=320 vs. T=32T=321), causal (T=32T=322 vs. T=32T=323), and overall (T=32T=324 vs. T=32T=325) accuracy measures.
  • MVBench: Achieves T=32T=326 (PLLaVA-matched train data) and up to T=32T=327 (extended MVBench data), compared to T=32T=328 for PLLaVA-7B baseline.
  • Egocentric Planning (EgoPlan): Outperforms baselines (T=32T=329% vs. MtR32×1024M_t \in \mathbb{R}^{32\times 1024}0% for PLLaVA).
  • NIAVH (Needle in a Video Haystack): For frame retrieval in 320s videos, averages MtR32×1024M_t \in \mathbb{R}^{32\times 1024}1 vs. MtR32×1024M_t \in \mathbb{R}^{32\times 1024}2 for PLLaVA, and maintains scores MtR32×1024M_t \in \mathbb{R}^{32\times 1024}34, even as the event of interest occurs near the video’s end.

Abridged results table:

Model VideoQA (EgoSchema, 32f) NExT-QA (all) MVBench EgoPlan
PLLaVA (7B, 32f) 43.8 68.2 46.6 30.26
VideoLLaMB (7B, 32f) 53.8 71.1 49.3–52.5 32.32
  • Streaming Captioning: SceneTilling enables unsupervised discovery of shot boundaries, allowing direct caption generation at segment transition points.
  • Qualitative Reasoning: In complex planning tasks (e.g., “open fridge → pour milk...”), VideoLLaMB maintains continuity, avoiding error patterns where baseline models rely only on first/final frames.

6. Analysis and Prospective Directions

Strengths of VideoLLaMB include:

  • Linear Memory Scaling: Processes up to 320 frames on a single GPU, facilitating long-form video reasoning previously inaccessible to large-scale models.
  • Semantic Segmentation: SceneTilling preserves intra-segment coherence, preventing irrelevant frames from corrupting summary.
  • Generalization: Demonstrates robust zero-shot performance across VideoQA, planning, and retrieval benchmarks.

Limitations are noted:

  • Memory Degradation: Occasional loss of early video events in extremely long contexts.
  • Retrieval Fragility: Overwriting or hallucinations (e.g., object confusion) under pressure from memory cache in tasks like NIAVH.
  • Out-of-domain Generalization: Degraded results when evaluated on video lengths double the training regime in the absence of further fine-tuning.

Proposed directions include integration of LLM-internal memory states into bridge memory, tuning for extended context lengths, and developing richer planning and streaming evaluation benchmarks.

7. Significance and Context

VideoLLaMB represents an overview of frozen visual encoders and LLMs, augmented with lightweight recurrent memory bridges and a deterministic, similarity-based video segmentation algorithm. This approach establishes new baselines for efficiency and efficacy in long-context video-language modeling by ensuring both semantic fidelity and tractable resource demands. The demonstrated improvements across a range of VideoQA, planning, and retrieval tasks position VideoLLaMB as a foundation for future explorations into scalable, context-aware video understanding frameworks (Wang et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to VideoLLaMB (Long-context Video Understanding with Recurrent Memory Bridges).