Papers
Topics
Authors
Recent
Search
2000 character limit reached

LLaVA-OneVision-2: Unified Vision-Language Model

Updated 26 May 2026
  • LLaVA-OneVision-2 is a unified vision-language model introducing codec-stream tokenization and adaptive windowed attention for enhanced spatiotemporal precision.
  • It leverages a OneVision-Encoder with 3D rotary positional encoding, efficiently processing static images, video frames, and codec-stream videos within a single Transformer architecture.
  • Empirical evaluations, including the JumpScore benchmark, demonstrate significant performance gains over uniform sampling, validating its innovative design and training strategies.

LLaVA-OneVision-2 (LLaVA-OV-2) is a next-generation codec-aligned vision-LLM that unifies high-resolution spatial reasoning, long-horizon video understanding, and fine-grained temporal grounding within a single Transformer-based architecture. Building on the LLaVA-OneVision series, LLaVA-OV-2 introduces codec-stream tokenization, adaptive windowed attention, and a unified spatiotemporal positional encoding, demonstrating state-of-the-art performance across a broad array of multimodal benchmarks, including the newly proposed JumpScore for high-frequency temporal grounding (An et al., 25 May 2026).

1. Model Architecture

The core of LLaVA-OV-2 is the OneVision-Encoder, a dynamic-resolution Vision Transformer supporting three input types via a unified patch-token interface:

  • Static images (single temporal group)
  • Uniformly sampled frame sequences (fixed four-slot IPPP groups)
  • Codec-stream videos (variable-length "GOP" groups of I- and P-canvases)

All inputs are partitioned into 14×1414 \times 14 pixel patches, merged into 2×22 \times 2 blocks within the encoder. Each patch embedding is augmented by 3D rotary positional encodings (3D RoPE) and uses a group-visible attention mask that restricts attention to within-group connectivity, ensuring isolation between distinct temporal segments.

Most encoder layers employ local window self-attention, limiting each patch to attend to a W×WW \times W spatial neighborhood. Layerwise window shifts (analogous to Swin Transformers) enable global spatial communication without incurring quadratic computational cost in spatial resolution. Attention for position ii and head/layer ℓ\ell takes the form:

Aℓij=softmax(QℓiKℓj+Bℓijd),j∈Window(i)A_\ell^{ij} = \text{softmax}\left(\frac{Q_\ell^i K_\ell^j + B_\ell^{ij}}{\sqrt{d}}\right), \quad j \in \text{Window}(i)

where Bâ„“ijB_\ell^{ij} is a relative position bias and Window(i)\text{Window}(i) selects spatially local indices.

2. Codec-Stream Tokenization

LLaVA-OV-2's major innovation is codec-stream tokenization, which leverages video codec signals to partition the temporal axis and to select spatial evidence patches, centering token allocation around eventful content.

The workflow consists of:

  • Continuous bit-cost stream: The compressed video timeline is split into BB equal bins; aggregate P/B-frame bit-cost per bin ebe_b is computed as a proxy for motion/event intensity. The average bit-cost quota per adaptive group of pictures (GOP) is

2×22 \times 20

  • Adaptive GOP grouping: GOP boundaries are set dynamically by either reaching a maximal temporal span 2×22 \times 21, or exceeding a minimum span 2×22 \times 22 and cumulative bit-cost threshold 2×22 \times 23. Additional refinement via "valley search" ensures boundaries align with local bit-cost minima.
  • Motion-residual spatial scoring: For each P-frame 2×22 \times 24, motion magnitude 2×22 \times 25 and luma residual 2×22 \times 26 yield a normalized spatial saliency map 2×22 \times 27.
  • 2×2 block selection and scoring: Each 2×22 \times 28 block's score 2×22 \times 29 aggregates saliency and bit-cost, ranking blocks for inclusion in the visual token set.
  • Stratified P-canvas allocation: Within each adaptive GOP, candidate blocks W×WW \times W0 are ranked. Frame- and block-wise ranks, together with attenuation for repeated frames, produce weights W×WW \times W1 for each frame. P-canvas slots sample from frames via a cumulative allocation curve W×WW \times W2. This prevents single frames from monopolizing all token slots.

A unified visual-token interface (canvas index, source-frame, coordinates, and group ID) standardizes multi-modality input, allowing seamless handling by shared encoder parameters.

3. Spatiotemporal Positional Encoding

The 3D Rotary Positional Encoding (3D RoPE) mechanism generalizes standard 1D RoPE to a W×WW \times W3 coordinate grid. For embedding dimension W×WW \times W4 split equally by axis, each vector is independently rotated along temporal, height, and width axes:

W×WW \times W5

where W×WW \times W6 applies sinusoidal rotation for axis W×WW \times W7 and offset W×WW \times W8, ensuring attention is sensitive to spatiotemporal offsets across all modalities. This shared encoding harmonizes sampling regimes (e.g., images, frames, codec canvases) within a unified coordinate system.

4. Training Strategy and Data

LLaVA-OV-2's data pipeline comprises large-scale open supervision spanning:

  • Pretraining (8M video-caption pairs, ~104B tokens): Clips are stratified by length—30s (4.2M), 30–60s (2.7M), 60–180s (0.7M), and 10–15 min (0.35M, consumed with both 384- and 768-frame budgets under codec-stream).
  • Spatial fine-tuning (4M samples): Supervision comprises structured 2D/3D QA (counting, direction, distances, ordering), embodied trajectories, web frame labels, and point-and-track tasks from Molmo2.
  • Curriculum stages: Training advances through four stages, progressively increasing context length and instruction complexity, shifting from frame-sampled to codec-stream-dominant batches. Stage-4 introduces the codec-stream paradigm, fuses in-depth spatial QA, and boosts hour-length context scaling.

Batch composition is dynamically balanced (approx. 50% codec-stream, 37.5% uniform video, 12.5% images), and the OneVision-Encoder's parameter sharing guarantees seamless multi-modality integration.

5. JumpScore Benchmark

JumpScore is introduced to evaluate fine-grained grounding in repetitive, high-frequency motion, exemplified by "jump rope" events:

  • 189 annotated in-the-wild videos, 30–90s each, median motion cycle 0.4s.
  • Decimal-second annotations of each rope-behind-legs event.
  • Prompt input: "List the start timestamps... rope behind legs."
  • Metric: mean average precision (mAP) at temporal tolerances W×WW \times W9 seconds.

Empirical results under matched visual-token budgets:

  • Uniform 128 frames: mAP ≈ 52.5
  • Codec-stream 128 GOP-aligned canvases: mAP ≈ 74.9 (Δ +22.4)
  • Average gain across budgets: +17.3 points

This demonstrates the codec-stream approach’s effectiveness at allocating tokens to precisely-boundary dynamic events, surpassing fixed-slot uniform frame sampling.

6. Empirical Performance

LLaVA-OV-2-8B yields significant improvements:

  • Video understanding (18 tasks): 62.5 vs 58.2 (+4.3) relative to Qwen3-VL-8B
  • Spatial reasoning (11 tasks): 63.5 vs 58.2 (+5.3)
  • Tracking (J∩F avg, 4 tasks): 48.0 vs 32.4 (+15.6)
  • JumpScore mAP: 74.9 vs 30.1 (+44.8)

On temporal grounding benchmarks (Charades, ANet, QVHighlights) with matched token budgets, codec-stream input achieves +9.7 absolute mAP over uniform sampling.

Benchmark Category LLaVA-OV-2-8B Qwen3-VL-8B Advantage
Video Understanding 62.5 58.2 +4.3
Spatial Reasoning 63.5 58.2 +5.3
Tracking (J∩F avg) 48.0 32.4 +15.6
JumpScore mAP 74.9 30.1 +44.8

7. Innovations and Future Directions

Key drivers of LLaVA-OV-2’s empirical gains include:

  • Codec-stream tokenization: Aligns visual tokens to perceptual transitions, supporting stable long-video compression without complex routing or inference overhead.
  • Adaptive, content-sensitive GOP partitioning: Utilizes bit-cost quotas and valley search to form temporally variable, content-aligned groups.
  • Motion-residual block selection: Balances intra-frame saliency with temporal event coverage.
  • Unified processing: Shared group-visible attention masks and 3D RoPE enable joint handling of images, sampled frames, and codec canvases.

Data-centric practices—length-stratified video-caption pretraining, extensive spatial-task fine-tuning, and progressive, modality-mixed curricula—are critical to both long-horizon understanding and high-fidelity spatial competency.

Anticipated research extensions include streaming perception (incremental inference with codec alignment), hierarchical memory models for ultra-long context retention, reinforcement-style finetuning for precise temporal grounding, 3D scene-graph and physics integration, and cross-modal retrieval/multimodal dialog leveraging the compressed stream.

LLaVA-OneVision-2 establishes that treating compressed video as a continuous, bitstream-driven perceptual signal, rather than as a sequence of uniformly sampled frames, is fundamental to advancing robust, scalable multimodal perceptual intelligence (An et al., 25 May 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LLaVA-OneVision-2 (LLaVA-OV-2).