Papers
Topics
Authors
Recent
Search
2000 character limit reached

ViLaMP: Hierarchical Video-Language Model

Updated 2 April 2026
  • ViLaMP is a hierarchical video-language model that employs differential distillation to prioritize query-relevant, temporally distinctive video regions.
  • It combines query-driven keyframe selection and patch-level pooling to compress ultra-long videos (up to 10,000 frames) while optimizing computational and memory efficiency.
  • Empirical results show ViLaMP achieves state-of-the-art performance on long-form benchmarks, reducing FLOPs and GPU memory usage compared to competing models.

ViLaMP is a hierarchical video-LLM developed to address the complexity and efficiency barriers of long-form video processing for vision-language understanding. Leveraging the principle of differential distillation, ViLaMP systematically assigns higher representational "precision" to video regions that are most relevant to a given textual query and least redundant within their temporal context. It combines query-driven keyframe selection with salient feature pooling to efficiently encode ultra-long videos (up to 10,000 frames) for downstream video-language tasks while optimizing computational and memory efficiency. ViLaMP has demonstrated state-of-the-art performance across multiple long-form video understanding benchmarks, supporting practical hour-scale inference on a single NVIDIA A100 GPU (cheng et al., 3 Apr 2025).

1. Differential Distillation Principle

ViLaMP operationalizes differential distillation, which prioritizes the retention of information in proportion to its utility for a downstream query QQ and its novelty within the temporal context. Given a video component vv (such as a frame or patch), the differential saliency score is:

D(v)=R(v,Q)T(v,C(v))D(v) = R(v, Q) - T(v, \mathcal{C}(v))

where R(v,Q)[1,1]R(v, Q) \in [-1, 1] quantifies the query relevance (e.g., as cosine similarity between the component- and query-embeddings), and T(v,C(v))[0,1]T(v, \mathcal{C}(v)) \in [0,1] quantifies its redundancy with respect to a set of context features C(v)\mathcal{C}(v).

The global selection and compression objective under computational budget BB seeks to maximize the sum of differential information among selected keyframes K\mathcal{K} and merged non-keyframe features:

maxK,{w}[fnKDf(fn)+fnKm=1MwnmDp(pnm)]\max_{\mathcal{K},\,\{w\}} \Bigg[ \sum_{f_n \in \mathcal{K}} D_f(f_n) + \sum_{f_n \notin \mathcal{K}} \sum_{m=1}^M w_n^m D_p(p_n^m) \Bigg]

with constraints KK,mwnm=1,wnm0|\mathcal{K}| \le K, \sum_m w_n^m=1, w_n^m \ge 0, and vv0. This principle enables ViLaMP to allocate more tokens to salient, query-critical content while aggressively compressing redundant video segments (cheng et al., 3 Apr 2025).

2. Hierarchical Model Architecture

ViLaMP employs a two-tier hierarchy to compress and encode long videos efficiently:

  • Frame Level (Mixed Precision): A subset of vv1 keyframes preserves full tokenized patch information (vv2 tokens per keyframe). The remaining vv3 non-keyframes are reduced to a single token each, reflecting mixed-precision treatment.
  • Patch Level: Each non-keyframe undergoes a learnable softmax pooling over its vv4 patch embeddings to retain those features that are both maximally relevant to the query and minimally redundant with the temporally neighboring keyframes.

The processing pipeline is as follows:

  1. Video vv5 CLIP-based encodings (via SigLIP-so400m).
  2. Differential Keyframe Selection (dks).
  3. Differential Feature Merging (dfm) for non-keyframes.
  4. Vision–Language Connector (multi-layer perceptrons).
  5. LLM prompt construction and answer generation (using Qwen2-7B) (cheng et al., 3 Apr 2025).

3. Differential Keyframe Selection Mechanism

Given a sequence of frames vv6 and a query vv7, each frame embedding vv8 and the query embedding vv9 are obtained via a CLIP encoder:

D(v)=R(v,Q)T(v,C(v))D(v) = R(v, Q) - T(v, \mathcal{C}(v))0

D(v)=R(v,Q)T(v,C(v))D(v) = R(v, Q) - T(v, \mathcal{C}(v))1

Frame redundancy relative to context D(v)=R(v,Q)T(v,C(v))D(v) = R(v, Q) - T(v, \mathcal{C}(v))2 is defined as:

D(v)=R(v,Q)T(v,C(v))D(v) = R(v, Q) - T(v, \mathcal{C}(v))3

A greedy selection procedure sorts frames by D(v)=R(v,Q)T(v,C(v))D(v) = R(v, Q) - T(v, \mathcal{C}(v))4, iteratively admitting frames whose maximum similarity with already selected keyframes does not exceed a threshold D(v)=R(v,Q)T(v,C(v))D(v) = R(v, Q) - T(v, \mathcal{C}(v))5, until the quota D(v)=R(v,Q)T(v,C(v))D(v) = R(v, Q) - T(v, \mathcal{C}(v))6 is reached. This produces a set D(v)=R(v,Q)T(v,C(v))D(v) = R(v, Q) - T(v, \mathcal{C}(v))7 of query-relevant, temporally distinctive keyframes. The computational complexity is D(v)=R(v,Q)T(v,C(v))D(v) = R(v, Q) - T(v, \mathcal{C}(v))8 (cheng et al., 3 Apr 2025).

4. Differential Feature Merging for Non-Keyframes

For each non-keyframe D(v)=R(v,Q)T(v,C(v))D(v) = R(v, Q) - T(v, \mathcal{C}(v))9, with R(v,Q)[1,1]R(v, Q) \in [-1, 1]0 spatial patches R(v,Q)[1,1]R(v, Q) \in [-1, 1]1 and nearest preceding keyframe R(v,Q)[1,1]R(v, Q) \in [-1, 1]2, per-patch relevance and redundancy are computed as:

R(v,Q)[1,1]R(v, Q) \in [-1, 1]3

R(v,Q)[1,1]R(v, Q) \in [-1, 1]4

Softmax pooling with sharpness R(v,Q)[1,1]R(v, Q) \in [-1, 1]5 yields weights R(v,Q)[1,1]R(v, Q) \in [-1, 1]6:

R(v,Q)[1,1]R(v, Q) \in [-1, 1]7

Aggregated token R(v,Q)[1,1]R(v, Q) \in [-1, 1]8 for R(v,Q)[1,1]R(v, Q) \in [-1, 1]9:

T(v,C(v))[0,1]T(v, \mathcal{C}(v)) \in [0,1]0

This mechanism ensures retention of spatially and temporally novel features for each compressed non-keyframe (cheng et al., 3 Apr 2025).

5. Training Regime, Optimization, and Inference

The vision–language connector consists of two separate two-layer MLPs: MLPT(v,C(v))[0,1]T(v, \mathcal{C}(v)) \in [0,1]1 for keyframe patch embeddings T(v,C(v))[0,1]T(v, \mathcal{C}(v)) \in [0,1]2; MLPT(v,C(v))[0,1]T(v, \mathcal{C}(v)) \in [0,1]3 for non-keyframe tokens T(v,C(v))[0,1]T(v, \mathcal{C}(v)) \in [0,1]4. Inputs are interleaved in temporal order with the query text appended, then passed to a LLM (e.g., Qwen2-7B). Training minimizes cross-entropy loss:

T(v,C(v))[0,1]T(v, \mathcal{C}(v)) \in [0,1]5

The data schedule spans three phases (approx. 9.2 million samples):

  1. Pretraining on WebVid, InternVid (7.4M video–caption pairs)
  2. Short-video QA tuning (1.3M MC & OE samples)
  3. Long-video fine-tuning (0.5M, e.g., FineVideo, CinePile)

Optimization uses AdamW with cosine decay learning rates (vision: T(v,C(v))[0,1]T(v, \mathcal{C}(v)) \in [0,1]6, rest: T(v,C(v))[0,1]T(v, \mathcal{C}(v)) \in [0,1]7), batch size 1 (T(v,C(v))[0,1]T(v, \mathcal{C}(v)) \in [0,1]8 with gradient accumulation), and mixed precision (FP16/FP8) for 32× A100 GPUs, completing one epoch in approximately two weeks.

ViLaMP reduces token complexity from T(v,C(v))[0,1]T(v, \mathcal{C}(v)) \in [0,1]9 in naive encodings to C(v)\mathcal{C}(v)0, utilizing approximately 16.3K tokens for C(v)\mathcal{C}(v)1\,K, C(v)\mathcal{C}(v)2, C(v)\mathcal{C}(v)3. Empirical measurements show ≈50% lower GPU memory and ≈18% FLOPs relative to VideoChat-Flash at 10K frames (cheng et al., 3 Apr 2025).

6. Empirical Performance and Benchmarking

A summary of ViLaMP’s comparative evaluation (7B parameters) against contemporaneous open-source video-LLMs (7–9B scale) is presented below.

Model LVBench EgoSchema LongVideoBench MLVU Video-MME Overall / Long
LLaVA-Video (7B) 65.6 58.2 70.8 63.3 / 69.7
NVILA (7B) 70.1 64.2 / 70.0
ViLaMP (7B, 1 FPS) 45.2 70.2 61.2 72.6 67.5 / 73.5

On the VideoNIAH “needle-in-a-haystack” benchmark (2K → 10K frames), ViLaMP sustains ≈58% accuracy at 10K frames, while VideoChat-Flash drops to ≈47%. At 8K frames, ViLaMP achieves FLOPs = 2.56T, memory = 45GB, compared to VideoChat-Flash's 13.9T FLOPs and 92GB memory usage (cheng et al., 3 Apr 2025).

7. Implementation Details and Usage

ViLaMP is implemented in PyTorch with Accelerate for mixed precision. Major components include:

  • Vision Encoder: SigLIP-so400m-patch14-384 (HuggingFace)
  • Embedding Backbone: CLIP-ViT-B-32 for frames and queries
  • LLM: Qwen2-7B
  • Image Resolution: 384×384 px
  • Default Hyperparameters: C(v)\mathcal{C}(v)4 keyframes, temporal threshold C(v)\mathcal{C}(v)5, patch redundancy weight C(v)\mathcal{C}(v)6, pooling sharpness C(v)\mathcal{C}(v)7
  • Tokenization: Keyframes retain C(v)\mathcal{C}(v)8 patch tokens; non-keyframes produce a single pooled token each
  • Memory/Throughput: Designed for efficient batch-1 inference on single A100 GPUs (10K-frame videos)
  • Code & Weights: Public release at https://github.com/steven-ccq/ViLAMP

ViLaMP extends the differential distillation approach across a lightweight hierarchical scheme, enabling end-to-end modeling on ultra-long video sequences while maintaining competitive accuracy and efficiency within its parameter regime (cheng et al., 3 Apr 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ViLaMP.