LLaVA-OneVision-2: Unified Vision-Language Model
- LLaVA-OneVision-2 is a unified vision-language model introducing codec-stream tokenization and adaptive windowed attention for enhanced spatiotemporal precision.
- It leverages a OneVision-Encoder with 3D rotary positional encoding, efficiently processing static images, video frames, and codec-stream videos within a single Transformer architecture.
- Empirical evaluations, including the JumpScore benchmark, demonstrate significant performance gains over uniform sampling, validating its innovative design and training strategies.
LLaVA-OneVision-2 (LLaVA-OV-2) is a next-generation codec-aligned vision-LLM that unifies high-resolution spatial reasoning, long-horizon video understanding, and fine-grained temporal grounding within a single Transformer-based architecture. Building on the LLaVA-OneVision series, LLaVA-OV-2 introduces codec-stream tokenization, adaptive windowed attention, and a unified spatiotemporal positional encoding, demonstrating state-of-the-art performance across a broad array of multimodal benchmarks, including the newly proposed JumpScore for high-frequency temporal grounding (An et al., 25 May 2026).
1. Model Architecture
The core of LLaVA-OV-2 is the OneVision-Encoder, a dynamic-resolution Vision Transformer supporting three input types via a unified patch-token interface:
- Static images (single temporal group)
- Uniformly sampled frame sequences (fixed four-slot IPPP groups)
- Codec-stream videos (variable-length "GOP" groups of I- and P-canvases)
All inputs are partitioned into pixel patches, merged into blocks within the encoder. Each patch embedding is augmented by 3D rotary positional encodings (3D RoPE) and uses a group-visible attention mask that restricts attention to within-group connectivity, ensuring isolation between distinct temporal segments.
Most encoder layers employ local window self-attention, limiting each patch to attend to a spatial neighborhood. Layerwise window shifts (analogous to Swin Transformers) enable global spatial communication without incurring quadratic computational cost in spatial resolution. Attention for position and head/layer takes the form:
where is a relative position bias and selects spatially local indices.
2. Codec-Stream Tokenization
LLaVA-OV-2's major innovation is codec-stream tokenization, which leverages video codec signals to partition the temporal axis and to select spatial evidence patches, centering token allocation around eventful content.
The workflow consists of:
- Continuous bit-cost stream: The compressed video timeline is split into equal bins; aggregate P/B-frame bit-cost per bin is computed as a proxy for motion/event intensity. The average bit-cost quota per adaptive group of pictures (GOP) is
0
- Adaptive GOP grouping: GOP boundaries are set dynamically by either reaching a maximal temporal span 1, or exceeding a minimum span 2 and cumulative bit-cost threshold 3. Additional refinement via "valley search" ensures boundaries align with local bit-cost minima.
- Motion-residual spatial scoring: For each P-frame 4, motion magnitude 5 and luma residual 6 yield a normalized spatial saliency map 7.
- 2×2 block selection and scoring: Each 8 block's score 9 aggregates saliency and bit-cost, ranking blocks for inclusion in the visual token set.
- Stratified P-canvas allocation: Within each adaptive GOP, candidate blocks 0 are ranked. Frame- and block-wise ranks, together with attenuation for repeated frames, produce weights 1 for each frame. P-canvas slots sample from frames via a cumulative allocation curve 2. This prevents single frames from monopolizing all token slots.
A unified visual-token interface (canvas index, source-frame, coordinates, and group ID) standardizes multi-modality input, allowing seamless handling by shared encoder parameters.
3. Spatiotemporal Positional Encoding
The 3D Rotary Positional Encoding (3D RoPE) mechanism generalizes standard 1D RoPE to a 3 coordinate grid. For embedding dimension 4 split equally by axis, each vector is independently rotated along temporal, height, and width axes:
5
where 6 applies sinusoidal rotation for axis 7 and offset 8, ensuring attention is sensitive to spatiotemporal offsets across all modalities. This shared encoding harmonizes sampling regimes (e.g., images, frames, codec canvases) within a unified coordinate system.
4. Training Strategy and Data
LLaVA-OV-2's data pipeline comprises large-scale open supervision spanning:
- Pretraining (8M video-caption pairs, ~104B tokens): Clips are stratified by length—30s (4.2M), 30–60s (2.7M), 60–180s (0.7M), and 10–15 min (0.35M, consumed with both 384- and 768-frame budgets under codec-stream).
- Spatial fine-tuning (4M samples): Supervision comprises structured 2D/3D QA (counting, direction, distances, ordering), embodied trajectories, web frame labels, and point-and-track tasks from Molmo2.
- Curriculum stages: Training advances through four stages, progressively increasing context length and instruction complexity, shifting from frame-sampled to codec-stream-dominant batches. Stage-4 introduces the codec-stream paradigm, fuses in-depth spatial QA, and boosts hour-length context scaling.
Batch composition is dynamically balanced (approx. 50% codec-stream, 37.5% uniform video, 12.5% images), and the OneVision-Encoder's parameter sharing guarantees seamless multi-modality integration.
5. JumpScore Benchmark
JumpScore is introduced to evaluate fine-grained grounding in repetitive, high-frequency motion, exemplified by "jump rope" events:
- 189 annotated in-the-wild videos, 30–90s each, median motion cycle 0.4s.
- Decimal-second annotations of each rope-behind-legs event.
- Prompt input: "List the start timestamps... rope behind legs."
- Metric: mean average precision (mAP) at temporal tolerances 9 seconds.
Empirical results under matched visual-token budgets:
- Uniform 128 frames: mAP ≈ 52.5
- Codec-stream 128 GOP-aligned canvases: mAP ≈ 74.9 (Δ +22.4)
- Average gain across budgets: +17.3 points
This demonstrates the codec-stream approach’s effectiveness at allocating tokens to precisely-boundary dynamic events, surpassing fixed-slot uniform frame sampling.
6. Empirical Performance
LLaVA-OV-2-8B yields significant improvements:
- Video understanding (18 tasks): 62.5 vs 58.2 (+4.3) relative to Qwen3-VL-8B
- Spatial reasoning (11 tasks): 63.5 vs 58.2 (+5.3)
- Tracking (J∩F avg, 4 tasks): 48.0 vs 32.4 (+15.6)
- JumpScore mAP: 74.9 vs 30.1 (+44.8)
On temporal grounding benchmarks (Charades, ANet, QVHighlights) with matched token budgets, codec-stream input achieves +9.7 absolute mAP over uniform sampling.
| Benchmark Category | LLaVA-OV-2-8B | Qwen3-VL-8B | Advantage |
|---|---|---|---|
| Video Understanding | 62.5 | 58.2 | +4.3 |
| Spatial Reasoning | 63.5 | 58.2 | +5.3 |
| Tracking (J∩F avg) | 48.0 | 32.4 | +15.6 |
| JumpScore mAP | 74.9 | 30.1 | +44.8 |
7. Innovations and Future Directions
Key drivers of LLaVA-OV-2’s empirical gains include:
- Codec-stream tokenization: Aligns visual tokens to perceptual transitions, supporting stable long-video compression without complex routing or inference overhead.
- Adaptive, content-sensitive GOP partitioning: Utilizes bit-cost quotas and valley search to form temporally variable, content-aligned groups.
- Motion-residual block selection: Balances intra-frame saliency with temporal event coverage.
- Unified processing: Shared group-visible attention masks and 3D RoPE enable joint handling of images, sampled frames, and codec canvases.
Data-centric practices—length-stratified video-caption pretraining, extensive spatial-task fine-tuning, and progressive, modality-mixed curricula—are critical to both long-horizon understanding and high-fidelity spatial competency.
Anticipated research extensions include streaming perception (incremental inference with codec alignment), hierarchical memory models for ultra-long context retention, reinforcement-style finetuning for precise temporal grounding, 3D scene-graph and physics integration, and cross-modal retrieval/multimodal dialog leveraging the compressed stream.
LLaVA-OneVision-2 establishes that treating compressed video as a continuous, bitstream-driven perceptual signal, rather than as a sequence of uniformly sampled frames, is fundamental to advancing robust, scalable multimodal perceptual intelligence (An et al., 25 May 2026).