Papers
Topics
Authors
Recent
2000 character limit reached

NVILA-Video-15B: Efficient High-Res Video VLM

Updated 13 December 2025
  • The paper introduces NVILA-Video-15B, a VLM that uses a scale-then-compress strategy to drastically reduce spatial and temporal token counts while preserving high-resolution detail.
  • It integrates a SigLIP-based vision encoder with a Qwen2-15B language backbone, achieving near state-of-the-art performance in video question answering benchmarks.
  • High efficiency is achieved through aggressive token compression and optimized training (e.g., FP8 arithmetic, dataset pruning), reducing compute cost by up to 4.5×.

NVILA-Video-15B is a large-scale open visual LLM (VLM) designed to maximize both efficiency and accuracy in high-resolution video and image understanding. Built upon the architectural foundation of VILA, it employs a "scale-then-compress" paradigm for spatial and temporal input, coupled with a suite of optimization techniques encompassing model design, training, inference, and fine-tuning. It integrates a SigLIP-based vision encoder with Qwen2-15B as its language backbone, and is systematically engineered to deliver near state-of-the-art performance on video question answering (QA) benchmarks, while drastically reducing computational requirements compared to previous VLMs (Liu et al., 5 Dec 2024).

1. Model Architecture

NVILA-Video-15B uses a multi-stage pipeline for visual input processing, combining a visual encoder, a projector, and a LLM processor.

  • Vision Encoder: SigLIP with a ViT-L backbone, 24 Transformer layers, 1,024 hidden dimension, and 16 attention heads, pretrained at 4482448^2 resolution.
  • Projector: A 2-layer MLP that maps the visual encoder's 1,024-dimensional output to the 2,048-dimensional embedding space of the LLM.
  • LLM (Token Processor): Qwen2-15B, comprising 48 Transformer layers, 12,288 hidden dimension, 96 attention heads, and accommodating a context length of up to 3,072 tokens (aggregating image, video, and text tokens).

The architecture enables high-resolution and long-context input via scale-then-compress mechanisms for both spatial and temporal domains:

Component Details Token/Param Counts
Vision Encoder SigLIP (ViT-L, 24L, 1,024d, 16h) ≈307 M params
Projector 2-layer MLP (1,024d→2,048d) ≈8 M params
LLM Processor Qwen2-15B (48L, 12,288d, 96h, 3,072 ctx window) 15 B params
Total Parameters — ≈15.3 B

Spatial Scale-then-Compress: Multi-scale dynamic tiling extracts overlapping 4482448^2 tiles at three scales (e.g., 4482448^2, 8962896^2, 1,34421,344^2). This yields Nin≈2,304−3,072N_{\mathrm{in}} \approx 2,304-3,072 tokens per image before compression. A 3×3 spatial-to-channel (STC) pool reduces each 16×1616 \times 16 visual patch to 11×1111 \times 11, giving Nout≈121N_{\mathrm{out}} \approx 121 tokens per image. The spatial compression ratio:

Rs=Nvis_inNvis_out≈162⋅S112≈19−25×R_s = \frac{N_{\mathrm{vis\_in}}}{N_{\mathrm{vis\_out}}} \approx \frac{16^2 \cdot S}{11^2} \approx 19-25\times

Temporal Scale-then-Compress: Inputs comprise 256 uniformly sampled frames per video (substantially extended from the 8 or 32 used in VILA-1.5). Initial tokenization (16216^2 patches per frame) yields Nvid_in=162×256=65,536N_{\mathrm{vid\_in}} = 16^2 \times 256 = 65,536 tokens. Average-pooling every 8 frames compresses to Nvid_out=8,192N_{\mathrm{vid\_out}} = 8,192 tokens—providing:

Rt=FinFout=25632=8×R_t = \frac{F_{\rm in}}{F_{\rm out}} = \frac{256}{32} = 8\times

Final spatiotemporal processing feeds these tokens into the LLM for multimodal reasoning.

2. Training and Fine-tuning Regimen

The training procedure is staged to align the multi-modal components and maximize transfer:

  1. Projector alignment: Pretraining with LLaVA-CC3M.
  2. Visual encoder pretraining: Leveraging ALLAVA and document/OCR corpora.
  3. Joint pretraining: Using datasets such as COYO, ShareGPT4V, Docmatix, and MMC4.
  4. Image instruction-tuning: Approximately 10M mixed vision-text samples.
  5. Video instruction-tuning: LLaVA-Video-178K and a subset of LLaVA-OneVision.

Key training parameters:

  • Optimizer: AdamW (no weight decay).
  • Learning rate schedule: Cosine decay with 3% warmup; peak LR stage-specific (e.g., Stage 2 at 5×10−55 \times 10^{-5}, Stage 5 at 2×10−52 \times 10^{-5}).
  • Batch size: 2,048 (distributed across 128× H100 GPUs).
  • Mixed precision: FP8 (COAT) for weights and activations.

Compute cost is reduced by 4.5× compared to a baseline 7B VILA model, requiring approximately 90 H100 GPU-days (2,160 GPU-hrs) for full pretraining, with video instruction-tuning consuming ~10,000 GPU-hrs.

3. Computational Efficiency and Memory Footprint

NVILA-Video-15B incorporates multiple strategies to minimize compute and memory requirements:

  • Token Compression: Aggressive spatiotemporal compression reduces the number of visual tokens by up to 25× spatially and up to 8× temporally.
  • Training Acceleration: FP8 arithmetic and large batch sizes result in a 2.0–2.9× increase in throughput compared to BF16.
  • Dataset Pruning: DeltaLoss pruning enables 50% data reduction with negligible accuracy loss, effectively doubling training speed.
  • Fine-tuning Memory Reduction: By employing PEFT techniques (LoRA or LN+LoRA), peak fine-tuning memory decreases from 63.5 GB to 19–21 GB, a 3.4× reduction.
  • Inference Quantization: Quantizing the vision tower (W8A8) and LLM (W4A16+FP16 accumulate) accelerates inference—time-to-first-token (TTFT) drops from 0.90s to 0.65s (1.4×), and decoding kernel throughput increases by 1.7×.

At inference, the prefilling stage (vision encoding plus LLM key/value cache) requires 8–12 GB (on NVIDIA 4090); decoding needs only 4–6 GB.

4. Inference Performance and Latency

Optimizations yield substantial improvements in real-world efficiency:

  • Prefilling (vision processing + LLM initial state): 1.6–2.2× faster than Qwen2-VL-7B under vLLM serving.
  • Decoding throughput: 1.2–2.8× higher with the quantized LLM weights.

This translates to lower operational costs for deployment and enables interactive large-scale multimodal applications with long video contexts.

5. Video Understanding Benchmarks and Accuracy

Evaluation covers a wide spectrum of video QA and reasoning datasets. For NVILA-8B (the closest analog with full numbers reported), results include:

Dataset (metric) NVILA-8B (256 fr) Qwen2-VL-7B Evaluation Type
Video-MME (no sub.) 64.2% 63.3% top-1 acc.
Video-MME (with sub) 70.0% 69.0% top-1 acc.

Other evaluation datasets include ActivityNet-QA (accuracy), LongVideoBench (val/test score), MLVU/MVBench (m-avg/test), and NExT-QA (multiple-choice accuracy).

For NVILA-15B, the projected accuracy improvements are +1–2% absolute over the 8B variant across all tasks, suggesting parity with or marginal improvement over state-of-the-art proprietary models (including GPT-4o mini) (Liu et al., 5 Dec 2024).

6. Design Significance and Implications

The NVILA-Video-15B design demonstrates that token-level spatiotemporal compression, combined with compute- and memory-optimized training and inference paradigms, can deliver high-accuracy VLMs with an order-of-magnitude greater resource efficiency. The scale-then-compress strategy allows for high-resolution and long-context video understanding without a linear explosion in computational requirements. Dataset pruning and mixed-precision arithmetic further substantiate that carefully curated and processed data, along with hardware-aware model design, are critical for sustainable scaling in future multimodal AI systems.

A plausible implication is that similar architectural and algorithmic advances could generalize to other large-scale multimodal domains where token explosion constrains practical deployment. The explicit separation of design for training, fine-tuning, and inference efficiency foregrounds the increasing importance of model lifecycle optimization for large backbone VLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to NVILA-Video-15B.