Papers
Topics
Authors
Recent
2000 character limit reached

LLaMA-VID: Scalable Video-Language Modeling

Updated 22 November 2025
  • LLaMA-VID is a video-language model framework that employs a dual-token per frame strategy to efficiently process long-form video content.
  • It compresses each video frame into two tokens, significantly reducing computational load while maintaining critical semantic information.
  • The architecture integrates advanced methods like scene tiling and sequential Q-Formers to preserve temporal dependencies and mitigate hallucinations.

LLaMA-VID refers to a class of video-LLMs that implement efficient, large-context video understanding by leveraging LLaMA or LLaMA-compatible LLMs in conjunction with highly-compressed visual token representations and advanced cross-modal fusion strategies. The canonical work introducing this concept is "LLaMA-VID: An Image is Worth 2 Tokens in LLMs" (Li et al., 2023), which establishes a dual-token-per-frame paradigm for scalable video processing. Subsequent architectures, such as VideoLLaMB (Wang et al., 2 Sep 2024), Video-LLaMA (Zhang et al., 2023), and related models, extend and refine the LLaMA-VID approach to address long-context processing, multimodal integration, and hallucination mitigation in video-language tasks.

1. Dual-Token Paradigm and Model Structure

The foundational design of LLaMA-VID focuses on efficient tokenization of video frames to enable the processing of hour-scale video content within LLMs' limited context windows. Each frame VtRH×W×3V_t \in \mathbb{R}^{H \times W \times 3} is passed through a frozen ViT-style visual encoder fvisf_{\rm vis} to yield patch embeddings XtRN×CX_t \in \mathbb{R}^{N \times C}, where NN is the number of patches, and CC is the embedding dimension. A Q-Former or instruction-aware transformer ftextf_{\rm text} generates MM text-guided query embeddings QtRM×CQ_t \in \mathbb{R}^{M \times C}.

The dual-token formulation computes, per frame:

  • Context token: Aggregates semantic content relevant to user instructions via attention,

tctx=EtWctx,  where  Et=Meani=1..M(AtXt),  At=Softmax(QtXtC)t_{\rm ctx} = E_t W_{\rm ctx}, \; \text{where} \; E_t = \mathrm{Mean}_{i=1..M}(A_t X_t),\; A_t = \mathrm{Softmax}\Big(\frac{Q_t X_t^\top}{\sqrt{C}}\Big)

  • Content token: Obtained by spatially average pooling and projection,

tcont=ZtWvis,Zt=Pool(Xt)t_{\rm cont} = Z_t W_{\rm vis},\, Z_t = \mathrm{Pool}(X_t)

These two tokens per frame, Tt=[tctx;tcont]T_t = [t_{\rm ctx}; t_{\rm cont}], are sequentially appended across all frames. The visual token stream [T1,...,TL][T_1, ..., T_L] is concatenated with the text prompt and directly fed to the frozen or lightly-tuned LLaMA decoder. No modifications to the transformer backbone are required beyond position encoding interpolation and context window extension for long videos (Li et al., 2023).

2. Training Strategies and Data Sources

LLaMA-VID adopts a staged training pipeline:

  1. Modality Alignment: Pretraining with large-scale image-caption and video-caption pairs (e.g., CC3M, WebVid) by minimizing cross-entropy loss on ground-truth captions, optimizing only the token projectors and Q-Former, while visual encoder and LLM weights remain frozen.
  2. Instruction Tuning: Fine-tuning on mixed-modality instruction datasets (ShareGPT, image QA, video QA) using the same objective.
  3. Long-Video Adaptation: For long-context capabilities, further tuning is performed with extended position encodings and additional synthetic long-form video QA, e.g., movie QA with up to 64K tokens (Li et al., 2023).

Data regimes can include on the order of 558K image-caption, 232K video-caption pairs, with long-form adaptation using 9K movie QA and 6K LongLoRA samples. Optimizer and hyperparameters are explicitly reported: AdamW with learning rates in [103,2×105][10^{-3}, 2 \times 10^{-5}], batch sizes up to 256 for alignment, and reduced (e.g., 8) for long-video phases.

3. Computational Efficiency and Scalability

By representing each frame with only two tokens, LLaMA-VID drastically reduces the memory and compute load compared to classical VLMs that use hundreds of tokens per frame. This compression allows models to accommodate thousands of frames (20,000 tokens for 10,000 frames) within a 64K-token window, compared to 2.5 million tokens for conventional encoding.

Empirical results indicate that training and inference remain feasible on standard hardware (e.g., 8×A100 GPUs), and the architecture supports end-to-end processing of three-hour videos in a single pass. Ablation shows collapsing content tokens from 256 to 1 yields only a 2–6% drop in QA accuracy, corresponding to a 128× compression (Li et al., 2023).

VideoLLaMB further advances scalability using recurrent memory bridge layers and the SceneTiling algorithm, enabling efficient linear scaling of GPU use over video length, supporting up to 320 frames on a single A100 GPU with near-linear memory growth (Wang et al., 2 Sep 2024). BridgeLayers operate as compact single-layer transformers, consuming memory proportional to the number of segments rather than total frames.

4. Extensions: Long-Context, Segmentation, and Memory

SceneTiling, introduced by VideoLLaMB (Wang et al., 2 Sep 2024), partitions videos into semantically coherent segments using frame-to-frame cosine similarity of ViT CLS tokens, capturing content boundaries by identifying abrupt similarity drops. Each segment’s per-frame features are processed alongside memory tokens maintained by single-layer BridgeLayers. Memory tokens mim_i are updated per segment, with retrieval mechanisms ensuring long-range dependency tracking:

mi+1mi+1+miretrm_{i+1} \leftarrow m_{i+1} + m_i^{\rm retr}

where miretrm_i^{\rm retr} is computed via cross-attention against a cache of all prior memory states. This approach preserves context over hundreds of frames without quadratic attention scaling.

The combination of semantic segmentation and recurrent memory has empirically been shown to improve zero-shot VideoQA, egocentric planning, and frame retrieval by 2–8 points over prior LLaMA-VID methods, with robust performance observed on benchmarks such as EgoSchema, NExT-QA, MVBench, and Needle-in-a-Video-Haystack (Wang et al., 2 Sep 2024).

5. Hallucination Mitigation and Temporal Modeling

Vista-LLaMA (Ma et al., 2023) identifies that standard rotary position encoding (RoPE) in cross-modal transformers causes visual signal attenuation in long answer sequences, leading to model hallucination. To remedy this, the Equal-Distance-to-Visual-Tokens (EDVT) attention removes RoPE from attention weights between visual and text tokens while retaining it for text–text pairs:

Attentionedvt(Q,K,V)j=iTsim(Rjqj,Riki)vi+iVsim(qj,ki)viiTsim(Rjqj,Riki)+iVsim(qj,ki)\text{Attention}_{edvt}(Q,K,V)_j = \frac{ \sum_{i \in T} \mathrm{sim}(R_j q_j, R_i k_i)\,v_i + \sum_{i \in V} \mathrm{sim}(q_j, k_i)\,v_i }{ \sum_{i\in T} \mathrm{sim}(R_j q_j, R_i k_i) + \sum_{i\in V} \mathrm{sim}(q_j, k_i) }

This sustains visual token influence across extended text, decreasing irrelevant generations.

Additionally, Vista-LLaMA introduces a sequential Q-Former, where each frame's visual tokens are projected using the previous frame's output as the next query, preserving local temporal dependencies without inflating token count. These enhancements yield improved grounding and state-of-the-art zero-shot VideoQA accuracy on NExT-QA and MSRVTT-QA (Ma et al., 2023).

6. Comparative Results and Practical Impact

Model VideoQA NExT-QA (%) MSVD-QA (%) MSRVTT-QA (%) ActivityNet-QA (%) Efficiency (Tokens/Frame)
Video-ChatGPT 54.6 64.9 49.3 35.2 >256
LLaMA-VID - 69.7 57.7 47.4 2
Vista-LLaMA 60.7 65.3 60.5 48.3 ≤32
VideoLLaMB 71.1 - - - (Segmented, linear)

LLaMA-VID and its derivatives outperform prior video-language baselines on standard QA and text generation evaluations. LLaMA-VID (Vicuna-7B) achieves 69.7% MSVD-QA accuracy, 57.7% on MSRVTT-QA, and 47.4% on ActivityNet-QA, surpassing Video-ChatGPT and BT-Adapter (Li et al., 2023). Vista-LLaMA demonstrates state-of-the-art results on zero-shot NExT-QA (60.7%), and VideoLLaMB further extends performance with robust long-context retention and linear memory scaling (Wang et al., 2 Sep 2024).

7. Limitations and Research Directions

Key limitations of LLaMA-VID include possible loss of fine-grained detail due to aggressive token compression (especially when using a single content token per frame), and the inherent context window limit (e.g., 64K tokens for the LLaMA backbone), which constrains ultra-long or multi-day video inputs (Li et al., 2023). Static pooling methods do not account for temporal salience or scene transitions, although methods such as SceneTiling and sequential Q-Formers ameliorate some aspects.

LLaMA-VID variants plan to incorporate adaptive tokenization, scene/keyframe detection, and richer multimodal fusion (e.g., audio, subtitles), drawing from methods such as Video-LLaMA's audio branch that leverages ImageBind for cross-modal alignment (Zhang et al., 2023). Extensions to interactive, multi-camera, and VR content are under consideration, with prospects for improved cross-modal adapters, temporal modeling, and further efficiency gains across increasingly long-form, multimodal video-language reasoning tasks.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to LLaMA-VID.