VideoLLaMB: Long-Context Video Understanding
- The paper introduces a recurrent memory bridge mechanism that preserves semantic continuity across extensive video contexts.
- It leverages a frozen ViT-L/14 encoder and a Vicuna-7B language model with minimal parameter tuning to maintain linear GPU memory scaling.
- Empirical results on VideoQA, planning, and retrieval benchmarks demonstrate significant performance improvements over baseline models.
VideoLLaMB (Long-context Video Understanding with Recurrent Memory Bridges) is a framework for efficient and semantically robust video-language modeling designed to process long video sequences while maintaining tractable computational and memory requirements. It employs a recurrent memory mechanism within bridge layers and introduces a model-free video segmentation algorithm, SceneTilling, to preserve semantic continuity across extensive video contexts. VideoLLaMB demonstrates notable performance improvements on multiple video question answering (VideoQA), planning, and retrieval benchmarks, while maintaining a linear GPU memory scaling profile. The architecture leverages a frozen ViT-L/14 video encoder and a Vicuna-7B-v1.5 LLM with modifications localized to a small set of bridge projection parameters (Wang et al., 2024).
1. Framework Architecture
VideoLLaMB’s architecture consists of modular components configured for minimal computational footprint and maximal extensibility:
- Video Encoder: Utilizes a frozen ViT-L/14 model to extract per-frame embeddings in .
- Language Backbone: Vicuna-7B-v1.5, context window 2048 tokens, with all parameters frozen except for the bridge projection.
- Memory Bridge Layers: Transformer-based modules that take as input a fixed set of temporal memory tokens and segment-level vision tokens, outputting both an updated set of memory tokens and a summary for downstream language reasoning.
- SceneTilling Segmenter: Algorithm to partition video into temporally coherent, semantically distinct segments.
- Memory Cache & Retrieval: Cache maintaining all prior memory states for cross-attentive retrieval, mitigating vanishing gradients across extended contexts.
The architecture operates by sampling video frames, segmenting them using SceneTilling, extracting features for each segment, and iteratively updating recurrent memory tokens via bridge layers. The summarized output is projected for the LLM, which performs tasks such as question answering or caption generation.
2. Recurrent Memory Bridges and Computational Characteristics
The core of VideoLLaMB’s long-context processing is the recurrent memory bridge mechanism:
- Memory Token Definition: At each time step , memory tokens encapsulate historical summary.
- BridgeLayer Update: Operates over concatenated memory and current segment tokens,
where contains frame embeddings. is then re-contextualized via cross-attention into a cache of all previous memory tokens:
0
This operation enables selective retrieval from extended history while preventing loss of earlier information.
- Computational Complexity: By constraining bridge attention to 1 tokens per segment, overall memory and compute cost is 2 for a video of 3 segments (typically 4 at training, up to 5 at test). This linear scaling is maintained in practice, enabling VideoLLaMB to process 320 frames on a single Nvidia A100/A800 GPU with a modest increase in memory usage (from 11GB for 16 frames to 24GB for 320 frames).
3. SceneTilling Algorithm: Semantic Video Segmentation
SceneTilling is a model-free segmentation approach developed to preserve semantic structure and minimize information dilution during recurrent summarization:
- Algorithmic Procedure: For an input sequence 6, with embeddings 7, pairwise cosine similarities 8 are computed. For each 9,
0
A segmentation threshold 1 is imposed, where 2 are the mean and std of 3. Frames at valley indices exceeding 4 are used as split points for 5 semantic segments.
- Integration with Memory: Each segment is independently processed by memory bridges, ensuring self-attention operations remain within semantically homogeneous regions, thereby optimizing recurrent memory efficiency and retaining intra-segment detail.
4. Training Regime and Memory Scaling
VideoLLaMB is instruction-tuned on a public video dataset (∼1 million frames) split into 16-frame clips, with open QA pairs for supervision:
- Training Configuration: Batch size 8, learning rate 6, 1 epoch, 4×A800 GPUs; only the bridge parameters are updated, with cross-entropy loss over target text tokens.
- Memory Efficiency: Owing to segment-localized attention, GPU memory usage increases linearly with the number of segments:
- 16 frames: 11 GB
- 32 frames: 13 GB
- 64 frames: 16 GB
- 320 frames: 24 GB
- This contrasts with the quadratic scaling typical of global self-attention, enabling practical long-form reasoning on standard hardware.
5. Empirical Performance and Benchmarks
VideoLLaMB’s empirical superiority is demonstrated across diverse evaluation protocols:
- EgoSchema VideoQA: For 32-frame context, VideoLLaMB achieves 7% accuracy vs. PLLaVA's 8%, an 9 point advantage.
- NExT-QA: Outperforms PLLaVA in temporal (0 vs. 1), causal (2 vs. 3), and overall (4 vs. 5) accuracy measures.
- MVBench: Achieves 6 (PLLaVA-matched train data) and up to 7 (extended MVBench data), compared to 8 for PLLaVA-7B baseline.
- Egocentric Planning (EgoPlan): Outperforms baselines (9% vs. 0% for PLLaVA).
- NIAVH (Needle in a Video Haystack): For frame retrieval in 320s videos, averages 1 vs. 2 for PLLaVA, and maintains scores 34, even as the event of interest occurs near the video’s end.
Abridged results table:
| Model | VideoQA (EgoSchema, 32f) | NExT-QA (all) | MVBench | EgoPlan |
|---|---|---|---|---|
| PLLaVA (7B, 32f) | 43.8 | 68.2 | 46.6 | 30.26 |
| VideoLLaMB (7B, 32f) | 53.8 | 71.1 | 49.3–52.5 | 32.32 |
- Streaming Captioning: SceneTilling enables unsupervised discovery of shot boundaries, allowing direct caption generation at segment transition points.
- Qualitative Reasoning: In complex planning tasks (e.g., “open fridge → pour milk...”), VideoLLaMB maintains continuity, avoiding error patterns where baseline models rely only on first/final frames.
6. Analysis and Prospective Directions
Strengths of VideoLLaMB include:
- Linear Memory Scaling: Processes up to 320 frames on a single GPU, facilitating long-form video reasoning previously inaccessible to large-scale models.
- Semantic Segmentation: SceneTilling preserves intra-segment coherence, preventing irrelevant frames from corrupting summary.
- Generalization: Demonstrates robust zero-shot performance across VideoQA, planning, and retrieval benchmarks.
Limitations are noted:
- Memory Degradation: Occasional loss of early video events in extremely long contexts.
- Retrieval Fragility: Overwriting or hallucinations (e.g., object confusion) under pressure from memory cache in tasks like NIAVH.
- Out-of-domain Generalization: Degraded results when evaluated on video lengths double the training regime in the absence of further fine-tuning.
Proposed directions include integration of LLM-internal memory states into bridge memory, tuning for extended context lengths, and developing richer planning and streaming evaluation benchmarks.
7. Significance and Context
VideoLLaMB represents an overview of frozen visual encoders and LLMs, augmented with lightweight recurrent memory bridges and a deterministic, similarity-based video segmentation algorithm. This approach establishes new baselines for efficiency and efficacy in long-context video-language modeling by ensuring both semantic fidelity and tractable resource demands. The demonstrated improvements across a range of VideoQA, planning, and retrieval tasks position VideoLLaMB as a foundation for future explorations into scalable, context-aware video understanding frameworks (Wang et al., 2024).