LongVLM: Hierarchical Video Language Model

Updated 2 April 2026

LongVLM is a video-based large language model that decomposes long videos into sequential segments, preserving fine-grained temporal details.
It employs hierarchical token merging and global semantic integration to balance local details with overall context, improving performance in VideoQA tasks.
The architecture achieves state-of-the-art results on benchmarks like VideoChatGPT while ensuring computational efficiency for thousands of frames.

LongVLM is a Video-based LLM (VideoLLM) architecture designed for efficient understanding of long videos, which typically consist of thousands of frames and thus tens of thousands of visual tokens. Unlike previous VideoLLMs that aggregate visual features into a single global representation, LongVLM decomposes videos into sequential short-term segments, applies hierarchical token merging to retain fine-grained local information, and integrates global semantic cues to enhance context comprehension. This approach balances computational efficiency with detailed temporal and spatial analysis, enabling significant gains over prior methods in both quantitative and qualitative evaluation benchmarks (Weng et al., 2024).

1. Motivation and Problem Formulation

Long video understanding poses challenges due to the high token count arising from many frames. For example, sampling 100 frames with 256 patch tokens per frame yields 25,600 tokens for a CLIP-ViT-L/14 backbone, exceeding the practical input constraints of off-the-shelf LLMs. Most existing VideoLLMs address this by compressing all frame features into a single vector, typically via pooling or learned queries, which captures overall context but fails to model fine-grained, temporally localized events. Further, there is often inadequate modeling of sequential sub-events and insufficient propagation of global semantics into localized representations. The formal goal is, given a video $I\in\mathbb{R}^{T\times H\times W\times 3}$ and user query $q$ , to generate a comprehensive answer $a$ such that

$a = \mathrm{LLM}\left(\Phi_V (\mathrm{VideoRep}(I)),\, q\right),$

where $\mathrm{VideoRep}(I)$ must preserve temporal order, retain both local and global information, and limit the total number of tokens (Weng et al., 2024).

2. Model Architecture

LongVLM’s pipeline comprises several distinct steps:

Frame-level Feature Extraction: Uniformly sample $T$ video frames $\{x^t\}_{t=1}^T$ , extract per-frame patch features $v^t \in \mathbb{R}^{N \times d}$ from a frozen vision encoder (e.g., CLIP-ViT-L/14), as well as $E$ [CLS] tokens $c^t_e$ from the last $q$ 0 layers.
Segmentation and Hierarchical Token Merging: Divide frames into $q$ 1 contiguous segments of length $q$ 2. Within each segment $q$ 3, collect all patch tokens to form $q$ 4, then reduce $q$ 5 to $q$ 6 tokens using a hierarchical bipartite matching mechanism inspired by ToMe.
Global Semantic Integration: Average [CLS] tokens temporally for each selected layer to obtain global feature vectors $q$ 7, then stack them as $q$ 8 and concatenate them with the temporally ordered local segment tokens $q$ 9. This ensures every local token can attend to the global context through self-attention.
Projection and Generation: Project the concatenated video token sequence $a$ 0 using a learned linear mapping $a$ 1 to the LLM’s input space. The frozen LLM (Vicuna-7B v1.1) receives both the visually encoded tokens and the text query, autoregressively generating the answer (Weng et al., 2024).

3. Mathematical Formulation

Key notation and computational steps are as follows:

$a$ 2: number of sampled frames; $a$ 3: patch tokens per frame; $a$ 4: channel dimension; $a$ 5: [CLS] global feature layers.
Hierarchical merging reduces segment patch tokens from $a$ $a$ 6 to $a$ $a$ 7 via:
1. Randomly split tokens into $a$ 8 and $a$ 9,
2. Compute multi-head cosine similarities across all pairs,
3. Pool and merge top-scoring pairs iteratively until $a = \mathrm{LLM}\left(\Phi_V (\mathrm{VideoRep}(I)),\, q\right),$ 0 remain.
Concatenated token sequence: $a = \mathrm{LLM}\left(\Phi_V (\mathrm{VideoRep}(I)),\, q\right),$ 1.
Only the projection layer $a = \mathrm{LLM}\left(\Phi_V (\mathrm{VideoRep}(I)),\, q\right),$ 2 is fine-tuned, optimizing the autoregressive token cross-entropy loss

$a = \mathrm{LLM}\left(\Phi_V (\mathrm{VideoRep}(I)),\, q\right),$ 3

with $a = \mathrm{LLM}\left(\Phi_V (\mathrm{VideoRep}(I)),\, q\right),$ 4 (Weng et al., 2024).

4. Training Protocol and Experimental Design

The model is evaluated on both VideoChatGPT benchmarks and zero-shot video question-answering (VideoQA) datasets. Training specifics:

Vision encoder: CLIP-ViT-L/14 (frozen).
LLM: Vicuna-7B v1.1 (frozen).
Fine-tune only $a = \mathrm{LLM}\left(\Phi_V (\mathrm{VideoRep}(I)),\, q\right),$ 5, using 3 epochs, learning rate 2e-5, batch size 32, and 4 × A100-80GB GPUs.
Frame sampling: $a = \mathrm{LLM}\left(\Phi_V (\mathrm{VideoRep}(I)),\, q\right),$ 6; segments: $a = \mathrm{LLM}\left(\Phi_V (\mathrm{VideoRep}(I)),\, q\right),$ 7, $a = \mathrm{LLM}\left(\Phi_V (\mathrm{VideoRep}(I)),\, q\right),$ 8; merged tokens: $a = \mathrm{LLM}\left(\Phi_V (\mathrm{VideoRep}(I)),\, q\right),$ 9 per segment; global tokens: $\mathrm{VideoRep}(I)$ 0; total visual tokens: 305.
Baselines: VideoChat, LLaMA Adapter v2, Video LLaMA, BT-Adapter, Valley, Video-ChatGPT (Weng et al., 2024).

5. Quantitative and Qualitative Results

LongVLM achieves state-of-the-art results on key long-video understanding benchmarks:

VideoChatGPT (mean across five criteria): LongVLM 2.89 vs. BT-Adapter 2.69, VideoChatGPT 2.42.
Zero-shot VideoQA:

Dataset	BT-Adapter	LongVLM (Acc. %)	LongVLM (Gen. Score)
ANET-QA	45.7	47.6	3.3
MSRVTT-QA	57.0	59.8	3.3
MSVD-QA	67.5	70.0	3.8

Ablation and Sensitivity:
- Global semantic integration yields the best mean score (2.89).
- Increasing $\mathrm{VideoRep}(I)$ 1 from 10 to 30 improves accuracy and mean score without significant memory overhead.
- Ablations confirm that both hierarchical merging and global integration are essential for optimal performance.

Qualitative comparisons demonstrate that LongVLM avoids the common errors of previous models (e.g., misclassifying object colors or actions) by preserving segment-level detail and incorporating global context. For example, on a bike repair video, LongVLM correctly identifies "brown clothes" and "bicycle chain," while VideoChatGPT outputs "gray clothes" and "wheel" (Weng et al., 2024).

6. Limitations and Future Directions

Current limitations include:

Output restricted to video-to-text generation.
Fixed frame sampling ( $\mathrm{VideoRep}(I)$ 2) may be suboptimal for much longer videos.
Segments are merged only intra-segment; no cross-segment token reduction. Possible future extensions involve support for multimodal generation (e.g., video+audio), pretraining hierarchical merging and global semantics integration on large-scale video datasets, and more adaptive segmentation schemes for variable-length videos (Weng et al., 2024).

7. Code and Accessibility

The implementation and pretrained models for LongVLM are available at https://github.com/ziplab/LongVLM (Weng et al., 2024). This enables reproducibility and further research exploration into efficient, high-fidelity video understanding using LLMs.

Markdown Report Issue Upgrade to Chat

References (1)

LongVLM: Efficient Long Video Understanding via Large Language Models (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LongVLM.