LLaVA-OneVision-2: Unified Vision-Language Model

Updated 26 May 2026

LLaVA-OneVision-2 is a unified vision-language model introducing codec-stream tokenization and adaptive windowed attention for enhanced spatiotemporal precision.
It leverages a OneVision-Encoder with 3D rotary positional encoding, efficiently processing static images, video frames, and codec-stream videos within a single Transformer architecture.
Empirical evaluations, including the JumpScore benchmark, demonstrate significant performance gains over uniform sampling, validating its innovative design and training strategies.

LLaVA-OneVision-2 (LLaVA-OV-2) is a next-generation codec-aligned vision-LLM that unifies high-resolution spatial reasoning, long-horizon video understanding, and fine-grained temporal grounding within a single Transformer-based architecture. Building on the LLaVA-OneVision series, LLaVA-OV-2 introduces codec-stream tokenization, adaptive windowed attention, and a unified spatiotemporal positional encoding, demonstrating state-of-the-art performance across a broad array of multimodal benchmarks, including the newly proposed JumpScore for high-frequency temporal grounding (An et al., 25 May 2026).

1. Model Architecture

The core of LLaVA-OV-2 is the OneVision-Encoder, a dynamic-resolution Vision Transformer supporting three input types via a unified patch-token interface:

Static images (single temporal group)
Uniformly sampled frame sequences (fixed four-slot IPPP groups)
Codec-stream videos (variable-length "GOP" groups of I- and P-canvases)

All inputs are partitioned into $14 \times 14$ pixel patches, merged into $2 \times 2$ blocks within the encoder. Each patch embedding is augmented by 3D rotary positional encodings (3D RoPE) and uses a group-visible attention mask that restricts attention to within-group connectivity, ensuring isolation between distinct temporal segments.

Most encoder layers employ local window self-attention, limiting each patch to attend to a $W \times W$ spatial neighborhood. Layerwise window shifts (analogous to Swin Transformers) enable global spatial communication without incurring quadratic computational cost in spatial resolution. Attention for position $i$ and head/layer $\ell$ takes the form:

$A_\ell^{ij} = \text{softmax}\left(\frac{Q_\ell^i K_\ell^j + B_\ell^{ij}}{\sqrt{d}}\right), \quad j \in \text{Window}(i)$

where $B_\ell^{ij}$ is a relative position bias and $\text{Window}(i)$ selects spatially local indices.

2. Codec-Stream Tokenization

LLaVA-OV-2's major innovation is codec-stream tokenization, which leverages video codec signals to partition the temporal axis and to select spatial evidence patches, centering token allocation around eventful content.

The workflow consists of:

Continuous bit-cost stream: The compressed video timeline is split into $B$ equal bins; aggregate P/B-frame bit-cost per bin $e_b$ is computed as a proxy for motion/event intensity. The average bit-cost quota per adaptive group of pictures (GOP) is

$2 \times 2$ 0

Adaptive GOP grouping: GOP boundaries are set dynamically by either reaching a maximal temporal span $2 \times 2$ 1, or exceeding a minimum span $2 \times 2$ 2 and cumulative bit-cost threshold $2 \times 2$ 3. Additional refinement via "valley search" ensures boundaries align with local bit-cost minima.
Motion-residual spatial scoring: For each P-frame $2 \times 2$ 4, motion magnitude $2 \times 2$ 5 and luma residual $2 \times 2$ 6 yield a normalized spatial saliency map $2 \times 2$ 7.
2×2 block selection and scoring: Each $2 \times 2$ 8 block's score $2 \times 2$ 9 aggregates saliency and bit-cost, ranking blocks for inclusion in the visual token set.
Stratified P-canvas allocation: Within each adaptive GOP, candidate blocks $W \times W$ 0 are ranked. Frame- and block-wise ranks, together with attenuation for repeated frames, produce weights $W \times W$ 1 for each frame. P-canvas slots sample from frames via a cumulative allocation curve $W \times W$ 2. This prevents single frames from monopolizing all token slots.

A unified visual-token interface (canvas index, source-frame, coordinates, and group ID) standardizes multi-modality input, allowing seamless handling by shared encoder parameters.

3. Spatiotemporal Positional Encoding

The 3D Rotary Positional Encoding (3D RoPE) mechanism generalizes standard 1D RoPE to a $W \times W$ 3 coordinate grid. For embedding dimension $W \times W$ 4 split equally by axis, each vector is independently rotated along temporal, height, and width axes:

$W \times W$ 5

where $W \times W$ 6 applies sinusoidal rotation for axis $W \times W$ 7 and offset $W \times W$ 8, ensuring attention is sensitive to spatiotemporal offsets across all modalities. This shared encoding harmonizes sampling regimes (e.g., images, frames, codec canvases) within a unified coordinate system.

4. Training Strategy and Data

LLaVA-OV-2's data pipeline comprises large-scale open supervision spanning:

Pretraining (8M video-caption pairs, ~104B tokens): Clips are stratified by length—30s (4.2M), 30–60s (2.7M), 60–180s (0.7M), and 10–15 min (0.35M, consumed with both 384- and 768-frame budgets under codec-stream).
Spatial fine-tuning (4M samples): Supervision comprises structured 2D/3D QA (counting, direction, distances, ordering), embodied trajectories, web frame labels, and point-and-track tasks from Molmo2.
Curriculum stages: Training advances through four stages, progressively increasing context length and instruction complexity, shifting from frame-sampled to codec-stream-dominant batches. Stage-4 introduces the codec-stream paradigm, fuses in-depth spatial QA, and boosts hour-length context scaling.

Batch composition is dynamically balanced (approx. 50% codec-stream, 37.5% uniform video, 12.5% images), and the OneVision-Encoder's parameter sharing guarantees seamless multi-modality integration.

5. JumpScore Benchmark

JumpScore is introduced to evaluate fine-grained grounding in repetitive, high-frequency motion, exemplified by "jump rope" events:

189 annotated in-the-wild videos, 30–90s each, median motion cycle 0.4s.
Decimal-second annotations of each rope-behind-legs event.
Prompt input: "List the start timestamps... rope behind legs."
Metric: mean average precision (mAP) at temporal tolerances $W \times W$ 9 seconds.

Empirical results under matched visual-token budgets:

Uniform 128 frames: mAP ≈ 52.5
Codec-stream 128 GOP-aligned canvases: mAP ≈ 74.9 (Δ +22.4)
Average gain across budgets: +17.3 points

This demonstrates the codec-stream approach’s effectiveness at allocating tokens to precisely-boundary dynamic events, surpassing fixed-slot uniform frame sampling.

6. Empirical Performance

LLaVA-OV-2-8B yields significant improvements:

Video understanding (18 tasks): 62.5 vs 58.2 (+4.3) relative to Qwen3-VL-8B
Spatial reasoning (11 tasks): 63.5 vs 58.2 (+5.3)
Tracking (J∩F avg, 4 tasks): 48.0 vs 32.4 (+15.6)
JumpScore mAP: 74.9 vs 30.1 (+44.8)

On temporal grounding benchmarks (Charades, ANet, QVHighlights) with matched token budgets, codec-stream input achieves +9.7 absolute mAP over uniform sampling.

Benchmark Category	LLaVA-OV-2-8B	Qwen3-VL-8B	Advantage
Video Understanding	62.5	58.2	+4.3
Spatial Reasoning	63.5	58.2	+5.3
Tracking (J∩F avg)	48.0	32.4	+15.6
JumpScore mAP	74.9	30.1	+44.8

7. Innovations and Future Directions

Key drivers of LLaVA-OV-2’s empirical gains include:

Codec-stream tokenization: Aligns visual tokens to perceptual transitions, supporting stable long-video compression without complex routing or inference overhead.
Adaptive, content-sensitive GOP partitioning: Utilizes bit-cost quotas and valley search to form temporally variable, content-aligned groups.
Motion-residual block selection: Balances intra-frame saliency with temporal event coverage.
Unified processing: Shared group-visible attention masks and 3D RoPE enable joint handling of images, sampled frames, and codec canvases.

Data-centric practices—length-stratified video-caption pretraining, extensive spatial-task fine-tuning, and progressive, modality-mixed curricula—are critical to both long-horizon understanding and high-fidelity spatial competency.

Anticipated research extensions include streaming perception (incremental inference with codec alignment), hierarchical memory models for ultra-long context retention, reinforcement-style finetuning for precise temporal grounding, 3D scene-graph and physics integration, and cross-modal retrieval/multimodal dialog leveraging the compressed stream.

LLaVA-OneVision-2 establishes that treating compressed video as a continuous, bitstream-driven perceptual signal, rather than as a sequence of uniformly sampled frames, is fundamental to advancing robust, scalable multimodal perceptual intelligence (An et al., 25 May 2026).

Markdown Report Issue Upgrade to Chat

References (1)

LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LLaVA-OneVision-2 (LLaVA-OV-2).