NVILA-Video-15B: Efficient High-Res Video VLM
- The paper introduces NVILA-Video-15B, a VLM that uses a scale-then-compress strategy to drastically reduce spatial and temporal token counts while preserving high-resolution detail.
- It integrates a SigLIP-based vision encoder with a Qwen2-15B language backbone, achieving near state-of-the-art performance in video question answering benchmarks.
- High efficiency is achieved through aggressive token compression and optimized training (e.g., FP8 arithmetic, dataset pruning), reducing compute cost by up to 4.5×.
NVILA-Video-15B is a large-scale open visual LLM (VLM) designed to maximize both efficiency and accuracy in high-resolution video and image understanding. Built upon the architectural foundation of VILA, it employs a "scale-then-compress" paradigm for spatial and temporal input, coupled with a suite of optimization techniques encompassing model design, training, inference, and fine-tuning. It integrates a SigLIP-based vision encoder with Qwen2-15B as its language backbone, and is systematically engineered to deliver near state-of-the-art performance on video question answering (QA) benchmarks, while drastically reducing computational requirements compared to previous VLMs (Liu et al., 5 Dec 2024).
1. Model Architecture
NVILA-Video-15B uses a multi-stage pipeline for visual input processing, combining a visual encoder, a projector, and a LLM processor.
- Vision Encoder: SigLIP with a ViT-L backbone, 24 Transformer layers, 1,024 hidden dimension, and 16 attention heads, pretrained at resolution.
- Projector: A 2-layer MLP that maps the visual encoder's 1,024-dimensional output to the 2,048-dimensional embedding space of the LLM.
- LLM (Token Processor): Qwen2-15B, comprising 48 Transformer layers, 12,288 hidden dimension, 96 attention heads, and accommodating a context length of up to 3,072 tokens (aggregating image, video, and text tokens).
The architecture enables high-resolution and long-context input via scale-then-compress mechanisms for both spatial and temporal domains:
| Component | Details | Token/Param Counts |
|---|---|---|
| Vision Encoder | SigLIP (ViT-L, 24L, 1,024d, 16h) | ≈307 M params |
| Projector | 2-layer MLP (1,024d→2,048d) | ≈8 M params |
| LLM Processor | Qwen2-15B (48L, 12,288d, 96h, 3,072 ctx window) | 15 B params |
| Total Parameters | — | ≈15.3 B |
Spatial Scale-then-Compress: Multi-scale dynamic tiling extracts overlapping tiles at three scales (e.g., , , ). This yields tokens per image before compression. A 3×3 spatial-to-channel (STC) pool reduces each visual patch to , giving tokens per image. The spatial compression ratio:
Temporal Scale-then-Compress: Inputs comprise 256 uniformly sampled frames per video (substantially extended from the 8 or 32 used in VILA-1.5). Initial tokenization ( patches per frame) yields tokens. Average-pooling every 8 frames compresses to tokens—providing:
Final spatiotemporal processing feeds these tokens into the LLM for multimodal reasoning.
2. Training and Fine-tuning Regimen
The training procedure is staged to align the multi-modal components and maximize transfer:
- Projector alignment: Pretraining with LLaVA-CC3M.
- Visual encoder pretraining: Leveraging ALLAVA and document/OCR corpora.
- Joint pretraining: Using datasets such as COYO, ShareGPT4V, Docmatix, and MMC4.
- Image instruction-tuning: Approximately 10M mixed vision-text samples.
- Video instruction-tuning: LLaVA-Video-178K and a subset of LLaVA-OneVision.
Key training parameters:
- Optimizer: AdamW (no weight decay).
- Learning rate schedule: Cosine decay with 3% warmup; peak LR stage-specific (e.g., Stage 2 at , Stage 5 at ).
- Batch size: 2,048 (distributed across 128× H100 GPUs).
- Mixed precision: FP8 (COAT) for weights and activations.
Compute cost is reduced by 4.5× compared to a baseline 7B VILA model, requiring approximately 90 H100 GPU-days (2,160 GPU-hrs) for full pretraining, with video instruction-tuning consuming ~10,000 GPU-hrs.
3. Computational Efficiency and Memory Footprint
NVILA-Video-15B incorporates multiple strategies to minimize compute and memory requirements:
- Token Compression: Aggressive spatiotemporal compression reduces the number of visual tokens by up to 25× spatially and up to 8× temporally.
- Training Acceleration: FP8 arithmetic and large batch sizes result in a 2.0–2.9× increase in throughput compared to BF16.
- Dataset Pruning: DeltaLoss pruning enables 50% data reduction with negligible accuracy loss, effectively doubling training speed.
- Fine-tuning Memory Reduction: By employing PEFT techniques (LoRA or LN+LoRA), peak fine-tuning memory decreases from 63.5 GB to 19–21 GB, a 3.4× reduction.
- Inference Quantization: Quantizing the vision tower (W8A8) and LLM (W4A16+FP16 accumulate) accelerates inference—time-to-first-token (TTFT) drops from 0.90s to 0.65s (1.4×), and decoding kernel throughput increases by 1.7×.
At inference, the prefilling stage (vision encoding plus LLM key/value cache) requires 8–12 GB (on NVIDIA 4090); decoding needs only 4–6 GB.
4. Inference Performance and Latency
Optimizations yield substantial improvements in real-world efficiency:
- Prefilling (vision processing + LLM initial state): 1.6–2.2× faster than Qwen2-VL-7B under vLLM serving.
- Decoding throughput: 1.2–2.8× higher with the quantized LLM weights.
This translates to lower operational costs for deployment and enables interactive large-scale multimodal applications with long video contexts.
5. Video Understanding Benchmarks and Accuracy
Evaluation covers a wide spectrum of video QA and reasoning datasets. For NVILA-8B (the closest analog with full numbers reported), results include:
| Dataset (metric) | NVILA-8B (256 fr) | Qwen2-VL-7B | Evaluation Type |
|---|---|---|---|
| Video-MME (no sub.) | 64.2% | 63.3% | top-1 acc. |
| Video-MME (with sub) | 70.0% | 69.0% | top-1 acc. |
Other evaluation datasets include ActivityNet-QA (accuracy), LongVideoBench (val/test score), MLVU/MVBench (m-avg/test), and NExT-QA (multiple-choice accuracy).
For NVILA-15B, the projected accuracy improvements are +1–2% absolute over the 8B variant across all tasks, suggesting parity with or marginal improvement over state-of-the-art proprietary models (including GPT-4o mini) (Liu et al., 5 Dec 2024).
6. Design Significance and Implications
The NVILA-Video-15B design demonstrates that token-level spatiotemporal compression, combined with compute- and memory-optimized training and inference paradigms, can deliver high-accuracy VLMs with an order-of-magnitude greater resource efficiency. The scale-then-compress strategy allows for high-resolution and long-context video understanding without a linear explosion in computational requirements. Dataset pruning and mixed-precision arithmetic further substantiate that carefully curated and processed data, along with hardware-aware model design, are critical for sustainable scaling in future multimodal AI systems.
A plausible implication is that similar architectural and algorithmic advances could generalize to other large-scale multimodal domains where token explosion constrains practical deployment. The explicit separation of design for training, fine-tuning, and inference efficiency foregrounds the increasing importance of model lifecycle optimization for large backbone VLMs.