NVILA-Video-15B: Efficient High-Res Video VLM

Updated 13 December 2025

The paper introduces NVILA-Video-15B, a VLM that uses a scale-then-compress strategy to drastically reduce spatial and temporal token counts while preserving high-resolution detail.
It integrates a SigLIP-based vision encoder with a Qwen2-15B language backbone, achieving near state-of-the-art performance in video question answering benchmarks.
High efficiency is achieved through aggressive token compression and optimized training (e.g., FP8 arithmetic, dataset pruning), reducing compute cost by up to 4.5×.

NVILA-Video-15B is a large-scale open visual LLM (VLM) designed to maximize both efficiency and accuracy in high-resolution video and image understanding. Built upon the architectural foundation of VILA, it employs a "scale-then-compress" paradigm for spatial and temporal input, coupled with a suite of optimization techniques encompassing model design, training, inference, and fine-tuning. It integrates a SigLIP-based vision encoder with Qwen2-15B as its language backbone, and is systematically engineered to deliver near state-of-the-art performance on video question answering (QA) benchmarks, while drastically reducing computational requirements compared to previous VLMs (Liu et al., 5 Dec 2024).

1. Model Architecture

NVILA-Video-15B uses a multi-stage pipeline for visual input processing, combining a visual encoder, a projector, and a LLM processor.

Vision Encoder: SigLIP with a ViT-L backbone, 24 Transformer layers, 1,024 hidden dimension, and 16 attention heads, pretrained at $448^2$ resolution.
Projector: A 2-layer MLP that maps the visual encoder's 1,024-dimensional output to the 2,048-dimensional embedding space of the LLM.
LLM (Token Processor): Qwen2-15B, comprising 48 Transformer layers, 12,288 hidden dimension, 96 attention heads, and accommodating a context length of up to 3,072 tokens (aggregating image, video, and text tokens).

The architecture enables high-resolution and long-context input via scale-then-compress mechanisms for both spatial and temporal domains:

Component	Details	Token/Param Counts
Vision Encoder	SigLIP (ViT-L, 24L, 1,024d, 16h)	≈307 M params
Projector	2-layer MLP (1,024d→2,048d)	≈8 M params
LLM Processor	Qwen2-15B (48L, 12,288d, 96h, 3,072 ctx window)	15 B params
Total Parameters	—	≈15.3 B

Spatial Scale-then-Compress: Multi-scale dynamic tiling extracts overlapping $448^2$ tiles at three scales (e.g., $448^2$ , $896^2$ , $1,344^2$ ). This yields $N_{\mathrm{in}} \approx 2,304-3,072$ tokens per image before compression. A 3×3 spatial-to-channel (STC) pool reduces each $16 \times 16$ visual patch to $11 \times 11$ , giving $N_{\mathrm{out}} \approx 121$ tokens per image. The spatial compression ratio:

$R_s = \frac{N_{\mathrm{vis\_in}}}{N_{\mathrm{vis\_out}}} \approx \frac{16^2 \cdot S}{11^2} \approx 19-25\times$

Temporal Scale-then-Compress: Inputs comprise 256 uniformly sampled frames per video (substantially extended from the 8 or 32 used in VILA-1.5). Initial tokenization ( $16^2$ patches per frame) yields $N_{\mathrm{vid\_in}} = 16^2 \times 256 = 65,536$ tokens. Average-pooling every 8 frames compresses to $N_{\mathrm{vid\_out}} = 8,192$ tokens—providing:

$R_t = \frac{F_{\rm in}}{F_{\rm out}} = \frac{256}{32} = 8\times$

Final spatiotemporal processing feeds these tokens into the LLM for multimodal reasoning.

2. Training and Fine-tuning Regimen

The training procedure is staged to align the multi-modal components and maximize transfer:

Projector alignment: Pretraining with LLaVA-CC3M.
Visual encoder pretraining: Leveraging ALLAVA and document/OCR corpora.
Joint pretraining: Using datasets such as COYO, ShareGPT4V, Docmatix, and MMC4.
Image instruction-tuning: Approximately 10M mixed vision-text samples.
Video instruction-tuning: LLaVA-Video-178K and a subset of LLaVA-OneVision.

Key training parameters:

Optimizer: AdamW (no weight decay).
Learning rate schedule: Cosine decay with 3% warmup; peak LR stage-specific (e.g., Stage 2 at $5 \times 10^{-5}$ , Stage 5 at $2 \times 10^{-5}$ ).
Batch size: 2,048 (distributed across 128× H100 GPUs).
Mixed precision: FP8 (COAT) for weights and activations.

Compute cost is reduced by 4.5× compared to a baseline 7B VILA model, requiring approximately 90 H100 GPU-days (2,160 GPU-hrs) for full pretraining, with video instruction-tuning consuming ~10,000 GPU-hrs.

3. Computational Efficiency and Memory Footprint

NVILA-Video-15B incorporates multiple strategies to minimize compute and memory requirements:

Token Compression: Aggressive spatiotemporal compression reduces the number of visual tokens by up to 25× spatially and up to 8× temporally.
Training Acceleration: FP8 arithmetic and large batch sizes result in a 2.0–2.9× increase in throughput compared to BF16.
Dataset Pruning: DeltaLoss pruning enables 50% data reduction with negligible accuracy loss, effectively doubling training speed.
Fine-tuning Memory Reduction: By employing PEFT techniques (LoRA or LN+LoRA), peak fine-tuning memory decreases from 63.5 GB to 19–21 GB, a 3.4× reduction.
Inference Quantization: Quantizing the vision tower (W8A8) and LLM (W4A16+FP16 accumulate) accelerates inference—time-to-first-token (TTFT) drops from 0.90s to 0.65s (1.4×), and decoding kernel throughput increases by 1.7×.

At inference, the prefilling stage (vision encoding plus LLM key/value cache) requires 8–12 GB (on NVIDIA 4090); decoding needs only 4–6 GB.

4. Inference Performance and Latency

Optimizations yield substantial improvements in real-world efficiency:

Prefilling (vision processing + LLM initial state): 1.6–2.2× faster than Qwen2-VL-7B under vLLM serving.
Decoding throughput: 1.2–2.8× higher with the quantized LLM weights.

This translates to lower operational costs for deployment and enables interactive large-scale multimodal applications with long video contexts.

5. Video Understanding Benchmarks and Accuracy

Evaluation covers a wide spectrum of video QA and reasoning datasets. For NVILA-8B (the closest analog with full numbers reported), results include:

Dataset (metric)	NVILA-8B (256 fr)	Qwen2-VL-7B	Evaluation Type
Video-MME (no sub.)	64.2%	63.3%	top-1 acc.
Video-MME (with sub)	70.0%	69.0%	top-1 acc.

Other evaluation datasets include ActivityNet-QA (accuracy), LongVideoBench (val/test score), MLVU/MVBench (m-avg/test), and NExT-QA (multiple-choice accuracy).

For NVILA-15B, the projected accuracy improvements are +1–2% absolute over the 8B variant across all tasks, suggesting parity with or marginal improvement over state-of-the-art proprietary models (including GPT-4o mini) (Liu et al., 5 Dec 2024).

6. Design Significance and Implications

The NVILA-Video-15B design demonstrates that token-level spatiotemporal compression, combined with compute- and memory-optimized training and inference paradigms, can deliver high-accuracy VLMs with an order-of-magnitude greater resource efficiency. The scale-then-compress strategy allows for high-resolution and long-context video understanding without a linear explosion in computational requirements. Dataset pruning and mixed-precision arithmetic further substantiate that carefully curated and processed data, along with hardware-aware model design, are critical for sustainable scaling in future multimodal AI systems.

A plausible implication is that similar architectural and algorithmic advances could generalize to other large-scale multimodal domains where token explosion constrains practical deployment. The explicit separation of design for training, fine-tuning, and inference efficiency foregrounds the increasing importance of model lifecycle optimization for large backbone VLMs.

PDF Markdown Chat (Pro)

References (1)

NVILA: Efficient Frontier Visual Language Models (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to NVILA-Video-15B.