ViLaMP: Hierarchical Video-Language Model
- ViLaMP is a hierarchical video-language model that employs differential distillation to prioritize query-relevant, temporally distinctive video regions.
- It combines query-driven keyframe selection and patch-level pooling to compress ultra-long videos (up to 10,000 frames) while optimizing computational and memory efficiency.
- Empirical results show ViLaMP achieves state-of-the-art performance on long-form benchmarks, reducing FLOPs and GPU memory usage compared to competing models.
ViLaMP is a hierarchical video-LLM developed to address the complexity and efficiency barriers of long-form video processing for vision-language understanding. Leveraging the principle of differential distillation, ViLaMP systematically assigns higher representational "precision" to video regions that are most relevant to a given textual query and least redundant within their temporal context. It combines query-driven keyframe selection with salient feature pooling to efficiently encode ultra-long videos (up to 10,000 frames) for downstream video-language tasks while optimizing computational and memory efficiency. ViLaMP has demonstrated state-of-the-art performance across multiple long-form video understanding benchmarks, supporting practical hour-scale inference on a single NVIDIA A100 GPU (cheng et al., 3 Apr 2025).
1. Differential Distillation Principle
ViLaMP operationalizes differential distillation, which prioritizes the retention of information in proportion to its utility for a downstream query and its novelty within the temporal context. Given a video component (such as a frame or patch), the differential saliency score is:
where quantifies the query relevance (e.g., as cosine similarity between the component- and query-embeddings), and quantifies its redundancy with respect to a set of context features .
The global selection and compression objective under computational budget seeks to maximize the sum of differential information among selected keyframes and merged non-keyframe features:
with constraints , and 0. This principle enables ViLaMP to allocate more tokens to salient, query-critical content while aggressively compressing redundant video segments (cheng et al., 3 Apr 2025).
2. Hierarchical Model Architecture
ViLaMP employs a two-tier hierarchy to compress and encode long videos efficiently:
- Frame Level (Mixed Precision): A subset of 1 keyframes preserves full tokenized patch information (2 tokens per keyframe). The remaining 3 non-keyframes are reduced to a single token each, reflecting mixed-precision treatment.
- Patch Level: Each non-keyframe undergoes a learnable softmax pooling over its 4 patch embeddings to retain those features that are both maximally relevant to the query and minimally redundant with the temporally neighboring keyframes.
The processing pipeline is as follows:
- Video 5 CLIP-based encodings (via SigLIP-so400m).
- Differential Keyframe Selection (dks).
- Differential Feature Merging (dfm) for non-keyframes.
- Vision–Language Connector (multi-layer perceptrons).
- LLM prompt construction and answer generation (using Qwen2-7B) (cheng et al., 3 Apr 2025).
3. Differential Keyframe Selection Mechanism
Given a sequence of frames 6 and a query 7, each frame embedding 8 and the query embedding 9 are obtained via a CLIP encoder:
0
1
Frame redundancy relative to context 2 is defined as:
3
A greedy selection procedure sorts frames by 4, iteratively admitting frames whose maximum similarity with already selected keyframes does not exceed a threshold 5, until the quota 6 is reached. This produces a set 7 of query-relevant, temporally distinctive keyframes. The computational complexity is 8 (cheng et al., 3 Apr 2025).
4. Differential Feature Merging for Non-Keyframes
For each non-keyframe 9, with 0 spatial patches 1 and nearest preceding keyframe 2, per-patch relevance and redundancy are computed as:
3
4
Softmax pooling with sharpness 5 yields weights 6:
7
Aggregated token 8 for 9:
0
This mechanism ensures retention of spatially and temporally novel features for each compressed non-keyframe (cheng et al., 3 Apr 2025).
5. Training Regime, Optimization, and Inference
The vision–language connector consists of two separate two-layer MLPs: MLP1 for keyframe patch embeddings 2; MLP3 for non-keyframe tokens 4. Inputs are interleaved in temporal order with the query text appended, then passed to a LLM (e.g., Qwen2-7B). Training minimizes cross-entropy loss:
5
The data schedule spans three phases (approx. 9.2 million samples):
- Pretraining on WebVid, InternVid (7.4M video–caption pairs)
- Short-video QA tuning (1.3M MC & OE samples)
- Long-video fine-tuning (0.5M, e.g., FineVideo, CinePile)
Optimization uses AdamW with cosine decay learning rates (vision: 6, rest: 7), batch size 1 (8 with gradient accumulation), and mixed precision (FP16/FP8) for 32× A100 GPUs, completing one epoch in approximately two weeks.
ViLaMP reduces token complexity from 9 in naive encodings to 0, utilizing approximately 16.3K tokens for 1\,K, 2, 3. Empirical measurements show ≈50% lower GPU memory and ≈18% FLOPs relative to VideoChat-Flash at 10K frames (cheng et al., 3 Apr 2025).
6. Empirical Performance and Benchmarking
A summary of ViLaMP’s comparative evaluation (7B parameters) against contemporaneous open-source video-LLMs (7–9B scale) is presented below.
| Model | LVBench | EgoSchema | LongVideoBench | MLVU | Video-MME Overall / Long |
|---|---|---|---|---|---|
| LLaVA-Video (7B) | – | 65.6 | 58.2 | 70.8 | 63.3 / 69.7 |
| NVILA (7B) | – | – | – | 70.1 | 64.2 / 70.0 |
| ViLaMP (7B, 1 FPS) | 45.2 | 70.2 | 61.2 | 72.6 | 67.5 / 73.5 |
On the VideoNIAH “needle-in-a-haystack” benchmark (2K → 10K frames), ViLaMP sustains ≈58% accuracy at 10K frames, while VideoChat-Flash drops to ≈47%. At 8K frames, ViLaMP achieves FLOPs = 2.56T, memory = 45GB, compared to VideoChat-Flash's 13.9T FLOPs and 92GB memory usage (cheng et al., 3 Apr 2025).
7. Implementation Details and Usage
ViLaMP is implemented in PyTorch with Accelerate for mixed precision. Major components include:
- Vision Encoder: SigLIP-so400m-patch14-384 (HuggingFace)
- Embedding Backbone: CLIP-ViT-B-32 for frames and queries
- LLM: Qwen2-7B
- Image Resolution: 384×384 px
- Default Hyperparameters: 4 keyframes, temporal threshold 5, patch redundancy weight 6, pooling sharpness 7
- Tokenization: Keyframes retain 8 patch tokens; non-keyframes produce a single pooled token each
- Memory/Throughput: Designed for efficient batch-1 inference on single A100 GPUs (10K-frame videos)
- Code & Weights: Public release at https://github.com/steven-ccq/ViLAMP
ViLaMP extends the differential distillation approach across a lightweight hierarchical scheme, enabling end-to-end modeling on ultra-long video sequences while maintaining competitive accuracy and efficiency within its parameter regime (cheng et al., 3 Apr 2025).