OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence
Abstract: Hypothesis. Artificial general intelligence is, at its core, a compression problem. Effective compression demands resonance: deep learning scales best when its architecture aligns with the fundamental structure of the data. These are the fundamental principles. Yet, modern vision architectures have strayed from these truths: visual signals are highly redundant, while discriminative information, the surprise, is sparse. Current models process dense pixel grids uniformly, wasting vast compute on static background rather than focusing on the predictive residuals that define motion and meaning. We argue that to solve visual understanding, we must align our architectures with the information-theoretic principles of video, i.e., Codecs. Method. OneVision-Encoder encodes video by compressing predictive visual structure into semantic meaning. By adopting Codec Patchification, OV-Encoder abandons uniform computation to focus exclusively on the 3.1%-25% of regions rich in signal entropy. To unify spatial and temporal reasoning under irregular token layouts, OneVision-Encoder employs a shared 3D RoPE and is trained with a large-scale cluster discrimination objective over more than one million semantic concepts, jointly capturing object permanence and motion dynamics. Evidence. The results validate our core hypothesis: efficiency and accuracy are not a trade-off; they are positively correlated. When integrated into LLM, it consistently outperforms strong vision backbones such as Qwen3-ViT and SigLIP2 across 16 image, video, and document understanding benchmarks, despite using substantially fewer visual tokens and pretraining data. Notably, on video understanding tasks, OV-Encoder achieves an average improvement of 4.1% over Qwen3-ViT. Codec-aligned, patch-level sparsity is a foundational principle, enabling OV-Encoder as a scalable engine for next-generation visual generalists.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Overview
This paper introduces a new way for computers to “watch” and understand videos and images more efficiently. The idea is called OneVision-Encoder (OV-Encoder). It focuses on the parts of a video that actually change (like motion) instead of wasting time on background areas that stay the same. The authors argue that smart visual understanding is really about compression: keep the important bits, skip the boring ones.
What questions does the paper try to answer?
- How can we design a vision system that pays attention to only the most informative parts of a video?
- Can focusing on motion and changes (instead of every pixel) make models both faster and more accurate?
- Will this approach work well for many tasks, like reading documents, understanding charts, and answering questions about videos?
How did they do it? (Methods explained simply)
Think of a video like a flipbook. Most pages look almost the same; only small parts change from one page to the next. Traditional models read every single page in detail. OV-Encoder reads the whole first page to get the full scene, then on later pages it mainly checks the small parts that move or change.
Here are the key ideas, explained with everyday language:
- Codec Patchification: A “codec” is what compresses videos by storing only changes between frames. OV-Encoder uses the codec’s signals (motion arrows and “residuals,” which are the leftover differences the motion doesn’t explain) to decide which small tiles (patches) of the image are worth keeping.
- Dense Video-Codec Patchification: Split each frame into small tiles. Keep all tiles for the first frame in a group (the “I-frame,” full scene). For later frames (called “P-frames”), only keep the tiles where something moved or changed. This often reduces the number of tiles by around 75–97%, while still capturing the action.
- Chunk-wise Patchification: If the video is long, split it into chunks and sample one frame per chunk. This keeps the timeline covered without reading every frame.
- Single-Image Spatial Patchification: For a single picture, just split it into tiles in a fixed order (top-to-bottom, left-to-right) so the model knows where everything is.
- 3D RoPE (positional encoding): Every tile gets a “label” that says where it is in space (x, y) and when it appears (time t). This helps the model understand how things move and where they are.
- Cluster discrimination training: Instead of using captions or labels for every image, they group millions of images and videos into “concept clusters” (like folders of similar objects or actions). The model learns to place new pictures and clips near the right cluster centers. This teaches it both object understanding (what things are) and motion understanding (what’s happening).
They trained OV-Encoder on huge collections of images and videos from the web and action datasets, in two stages:
- Stage 1: Image-only training to learn strong object features.
- Stage 2: Add videos (with the codec-guided patch picking) and OCR (text in images) to learn actions, motion, and reading.
What did they find, and why does it matter?
The main finding is that focusing on the right patches (the ones with motion or change) makes the model both more accurate and more efficient.
Highlights:
- Stronger video understanding: OV-Encoder beats well-known models like SigLIP2, DINOv3, and Qwen3-ViT on many video benchmarks. For example, on the Diving-48 dataset (a tough action recognition task), OV-Encoder is up to 17% more accurate than SigLIP2 and 8% more accurate than DINOv3 under the same tile (patch) budget.
- Better multimodal performance: When plugged into a LLM (Qwen3-4B), OV-Encoder improves scores on 16 benchmarks covering videos, images, documents, and OCR (reading text in images). It does this even while using fewer visual tokens (tiles) and less caption data during pretraining.
- Efficient at different budgets: When you give the model more or fewer tiles to use (like 512, 1024, 2048, or 4096 patches), the codec-guided selection consistently outperforms dense, frame-by-frame processing. It’s smart spending: the same or less compute gets more accurate results.
- Causal tests: When they purposely mess with the selected motion patches (replace half with non-motion, swap motion from a different video, or shuffle positions), accuracy drops a lot. This shows those chosen patches really do carry the important information.
Why this matters:
- It proves that “efficiency vs. accuracy” is not a trade-off here. By aligning the model with how videos actually store information (codecs), you get both speed and smarts.
- It suggests a path toward scalable, general visual intelligence: learn from changes, not just raw pixels.
What is the potential impact?
If future vision systems adopt this codec-aligned strategy, we can build:
- Faster, cheaper models that still achieve top performance on video understanding, document reading, and multimodal tasks (vision + language).
- Systems that handle long videos better by keeping full temporal coverage but only focusing on the important parts.
- More general-purpose “visual brains” for AI that see, track, and reason over time, helping in areas like sports analysis, security, medical video understanding, robotics, and assistive technologies.
In short, OV-Encoder shows that paying attention to “what changes” is a powerful foundation for smarter, more efficient AI that understands the visual world.
Knowledge Gaps
Knowledge Gaps, Limitations, and Open Questions
The following list captures concrete gaps and unresolved questions that could guide future research and targeted experiments:
- Dependence on HEVC signals: How does codec-guided patch selection generalize when videos are not HEVC-encoded (e.g., raw camera streams, AV1, VP9, H.264)? Compare against optical-flow or learned motion estimators as alternatives to codec motion/residual signals.
- Real-time feasibility: What is the end-to-end latency and throughput impact of running a codec to extract motion/residuals at inference time? Report wall-clock speed, memory footprint, and energy versus dense baselines on common hardware (A100/A800, consumer GPUs).
- Robustness to camera motion and dynamic backgrounds: Does codec-aligned sparsity over-focus on irrelevant motion (e.g., hand-held shake, water, foliage) while missing semantically critical but static regions? Evaluate on datasets with controlled camera motion (e.g., Ego4D) and dynamic backgrounds.
- Static semantics under low motion: How often does the method miss important static content (e.g., signage, small objects, text overlays)? Quantify failure cases on fine-grained, text-heavy, and small-object tasks; explore hybrid selection that guarantees a baseline of static patches.
- Fixed sparsity ratio r: The sparsity proportion (3.1%–25%) is fixed. Can content-adaptive or task-adaptive sparsity (dynamic r) improve accuracy–efficiency trade-offs? Test learned gating/policy for patch budgets.
- Token budget allocation: The clip-level budget assigns 512 tokens to I-frames and 1,536 to P-frames. What are optimal allocations across scenes and tasks? Ablate different I/P allocations and per-GOP versus per-clip budgeting.
- Patch-size and codec CU mismatch: ViT patches (e.g., p=14) aggregate CU motion/residuals with CU sizes of 4×4–64×64. Does multi-scale patchification or patch-size adaptation improve selection fidelity? Ablate patch sizes and multi-scale schemes.
- 3D-RoPE choices: How does 3D-RoPE compare to ALiBi, learned relative encodings, or spatiotemporal sinusoidal schemes on irregular layouts? Provide systematic ablations across positional encoding families.
- Attentive pooling head: Is the observed performance specific to the chosen pooling? Compare to CLS-token, mean pooling, Perceiver-style cross-attention, or hierarchical pooling to isolate head effects.
- Offline clustering dependency: Pseudo-labels come from a frozen metaclip-h14 encoder. How sensitive is performance to this encoder choice, clustering granularity, and feature normalization? Evaluate multiple encoders and centroid counts.
- Cluster quality and update schedule: Single-step clustering may become stale as the model improves. Does iterative or periodic re-clustering (with modest overhead) yield better representations? Measure label noise vs. gains.
- Multi-label and negative sampling hyperparameters: Top-10 positives and r=0.1 negative center sampling are fixed. Conduct controlled ablations on the number of positives/negatives and sampling strategies to assess stability and performance.
- Long-horizon temporal modeling: Pretraining uses clips up to 64 frames; attentive probing uses up to 16 frames. How does codec-aligned sparsity scale to minute-long contexts or streaming inputs? Evaluate on long-video benchmarks with explicit memory mechanisms.
- Failure-mode analysis: Provide qualitative/quantitative audits of cases where codec patchification harms performance (e.g., TOMATO, specific OCR/document tasks). Identify patterns to guide hybrid or fallback selection policies.
- Cross-codec generality: Is “HEVC-style” alignment essential, or do similar gains arise with other codec families (H.264, AV1)? Evaluate cross-codec motion/residual signals and standardized interfaces.
- End-to-end LMM generalization: Results rely on Qwen3-4B; do gains transfer across LLMs (e.g., Llama, Mistral, Qwen2)? Replicate LMM-probing with matched training to test encoder–LM coupling effects.
- Efficiency reporting gap: Beyond token reduction, provide standardized metrics (images/sec, frames/sec, GPU hours, carbon footprint) for pretraining and inference to substantiate “efficiency correlates with accuracy.”
- Domain shift and fairness: LAION/COYO-heavy image data may encode societal biases. Conduct demographic bias, fairness, and robustness audits across groups, languages, document types, and geographies; report mitigations.
- Document/OCR specialization: OV-Encoder lags on some OCR/document tasks. What specialized tokenization (e.g., text-aware patchification), heads, or auxiliary losses improve dense text understanding without sacrificing video gains?
- Chunk-wise patchification granularity: Chunk-level temporal offsets may blur intra-chunk timing. Test alternatives with fine-grained temporal encodings or multiple frames per chunk to preserve local order.
- Instruction-tuning coverage: The 1.5M LLaVA-Next(+Videos) corpus may under-represent certain video reasoning skills. Assess sensitivity to other instruction sets and richer temporal supervision (e.g., narration alignment).
- Streaming and on-device use: Can the approach operate without full video access (online patch selection, low-power devices)? Prototype lightweight motion estimation to approximate codec signals on mobile/embedded hardware.
- Interpretability of selected patches: Provide visualizations and human studies linking codec-selected patches to semantic evidence; quantify how selection aligns with human judgments across tasks.
- Security/adversarial robustness: Are motion/residual signals vulnerable to adversarial edits (e.g., subtle periodic patterns that hijack selection)? Test patch selection stability under corruption, compression artifacts, and adversarial perturbations.
- Data deduplication and leakage: The Union-Find dedup step is described but not quantified across sources. Report duplication rates, cross-source leakage, and any overlap with evaluation sets to ensure clean splits.
- Probe architecture disclosure: “Attentive probe with frozen backbones” is under-specified. Release probe design and training details; evaluate sensitivity to probe capacity and regularization to ensure fair backbone comparisons.
Practical Applications
Immediate Applications
The following applications can be deployed with current models and tooling described in the paper and associated open-source releases. Each item includes sectors, potential tools/workflows, and key assumptions or dependencies.
- Healthcare: real-time analysis of endoscopy, ultrasound, and laparoscopic videos to flag motion-based anomalies or procedural events
- Potential tools/workflows: integrate OneVision-Encoder (OV-Encoder) into existing PACS or imaging pipelines; use codec-derived motion vectors from HEVC streams to prioritize dynamic regions; deploy as an inference microservice behind hospital video storage
- Assumptions/dependencies: access to codec signals (motion vectors, residuals) during decoding; domain adaptation for medical imagery; regulatory validation (FDA/CE) before clinical use
- Manufacturing and Industrial IoT: anomaly/event detection on production-line cameras with reduced compute
- Potential tools/workflows: GStreamer/FFmpeg plugins that expose codec-guided patchification; edge inference on NVIDIA Jetson-class devices using OV-Encoder; alerting pipelines in SCADA/MES systems
- Assumptions/dependencies: consistent video codecs and stable network/storage; ability to extract motion/residual for VP9/AV1 if HEVC not used (adapter needed)
- Security and Smart Cities: lightweight multi-camera analytics for intrusion, loitering, or crowd motion patterns
- Potential tools/workflows: deploy OV-Encoder with 87.5% token reduction to process longer clips per camera; integrate with VMS platforms for content moderation and event triage
- Assumptions/dependencies: privacy and data governance compliance; real-time decode access to codec metadata; robustness across diverse lighting/weather conditions
- Robotics (mobile and warehouse): on-device perception that focuses on dynamic regions for navigation and manipulation
- Potential tools/workflows: ROS node using OV-Encoder with 3D-RoPE for irregular token layouts; patch-budget scheduling for battery-constrained robots; action recognition for grasping or obstacle avoidance
- Assumptions/dependencies: reliable extraction of temporal signals on embedded hardware; latency constraints met via sparse patch selection; safety validation
- Education and EdTech: AI tutors for diagrams and charts, plus lecture video Q&A and summarization
- Potential tools/workflows: pair OV-Encoder with Qwen3-4B or LLaVA-Next-Videos to power AI2D/ChartQA/DocVQA tasks; automatic slide/video Q&A generation focusing on speaker or board motion
- Assumptions/dependencies: access to instruction-tuning corpora; data privacy for recorded lectures; handling of domain-specific diagrams
- Finance and Enterprise Operations: document understanding at scale (invoices, statements, contracts) and chart analytics
- Potential tools/workflows: batch OCR + DocVQA with OV-Encoder-stage2; ERP/CRM connectors that export documents for processing; dashboards with extracted fields and chart insights
- Assumptions/dependencies: OCR quality and domain tuning; compliance with data retention policies; multilingual document handling
- Media and Sports Analytics: highlight detection, player/action recognition, and timeline summarization from long-form broadcasts
- Potential tools/workflows: codec-aligned token scheduler for match-long feeds; event extraction pipelines; ad-insertion/thumbnailing tools using motion-rich patches
- Assumptions/dependencies: integration with broadcast codecs and asset management systems; rights and licensing for content; robustness to camera cuts and overlays
- Streaming Platforms and Content Moderation: scalable server-side video understanding with lower GPU hours per hour of content
- Potential tools/workflows: inference services that process I/P-frame sequences and sparse P-frame patches; policy compliance checks (violence, self-harm) with higher throughput
- Assumptions/dependencies: cost models tied to GPU utilization; maintaining accuracy with sparse tokens; adjustments for non-HEVC codecs
- Software/ML Engineering: drop-in replacement backbone for multimodal LLMs to cut inference cost while improving accuracy
- Potential tools/workflows: Hugging Face models and GitHub training code; adapters for Qwen3-VL/LLaVA-Next; MLOps recipes for patch-budget scaling and attentive pooling
- Assumptions/dependencies: compatibility with existing inference stacks; retraining or alignment tuning for target tasks; monitoring for domain drift
- Energy and Sustainability: lower carbon footprint per processed video via codec-aligned sparsity
- Potential tools/workflows: “green AI” dashboards that report token reductions and kWh savings; capacity planning that leverages sparse patch processing for long clips
- Assumptions/dependencies: accurate metering of compute and energy; codec metadata availability; organizational willingness to prioritize efficiency
- Daily Life: smarter home cameras and mobile apps (document scanning, whiteboard capture, AR overlays) that run longer on battery
- Potential tools/workflows: mobile SDK using OV-Encoder for motion-aware capture; home NVR firmware update to enable sparse processing; app features for chart/table comprehension with OCR
- Assumptions/dependencies: mobile hardware support for codec metadata; per-app privacy controls; UI to surface confidence and errors
- Policy and Governance: procurement and benchmarking frameworks that reward codec-aligned efficiency and transparency
- Potential tools/workflows: evaluation checklists referencing patch-budget metrics; reporting templates for reproducible multimodal research (as released by the authors)
- Assumptions/dependencies: adoption by standards bodies; clarity on HEVC patent/licensing implications; datasets audited for bias and consent
Long-Term Applications
These applications require further research, scaling, domain adaptation, or ecosystem development before widespread deployment.
- Autonomous Driving and ADAS: real-time multimodal perception using codec-aligned sparsity across multi-sensor video
- Potential tools/workflows: fusion pipelines that prioritize motion-residual evidence across cameras; hardware decode blocks exposing motion vectors to ML accelerators
- Assumptions/dependencies: safety certification; AV1/VP9/automotive codecs support for motion/residual extraction; rigorous long-tail validation
- AR Glasses and Wearables: on-head multimodal assistants that understand scenes, documents, and activities continuously
- Potential tools/workflows: edge model compression with 3D-RoPE; chunk-wise patchification for low-latency streaming; context-aware summarization across hours of video
- Assumptions/dependencies: ultra-low-power hardware; privacy-preserving continual learning; ergonomic UX for live assistance
- Video Search Engines and Knowledge Bases: indexing the world’s videos via predictive compression signals for fine-grained retrieval
- Potential tools/workflows: “video GPT” pipelines that cluster object/motion semantics at web scale; semantic timelines for rapid navigation
- Assumptions/dependencies: robust cross-domain generalization; scalable offline clustering beyond 1M concepts; copyright and consent frameworks
- Sign Language and Human Motion Understanding: robust recognition and translation focused on motion-centric patches
- Potential tools/workflows: specialized motion-residual adaptation for hands/face; multi-view fusion; downstream language generation with LMMs
- Assumptions/dependencies: high-quality labeled datasets; cultural and linguistic nuance modeling; deployment in accessibility platforms
- Event-Based and Neuromorphic Vision Synergy: unify codec-aligned sparsity with event-camera streams for ultra-efficient perception
- Potential tools/workflows: hybrid tokenizers that map events and codec signals into joint sparse token layouts; custom 3D positional encoding for asynchronous inputs
- Assumptions/dependencies: hardware availability and integration; new training objectives; benchmarks and evaluation methodology
- Hardware-Software Co-Design: accelerators and drivers that expose codec motion/residual signals natively to ML stacks
- Potential tools/workflows: FFmpeg/GStreamer extensions; GPU/ISP firmware enabling “patchification-first” pipelines; memory layouts optimized for sparse tokens
- Assumptions/dependencies: vendor support; standards for motion/residual APIs; performance-portability across devices
- Clinical Decision Support from Long-Form Procedures: continuous reasoning over hours-long surgical videos to surface rare events
- Potential tools/workflows: chunk-wise temporal modeling with shared 3D-RoPE; multimodal integration with sensor logs; audit trails for review
- Assumptions/dependencies: longitudinal validation; medico-legal considerations; secure storage and compute
- Organizational Policy and Standards: formal efficiency metrics, bias audits, and reproducibility mandates for multimodal systems
- Potential tools/workflows: codec-aligned efficiency benchmarks; model cards that report patch budgets and token reductions; governance templates
- Assumptions/dependencies: cross-industry consensus; updates to procurement standards; open datasets with clear licensing
- Enterprise Knowledge Assistants: unified assistants that reason over video, documents, and workflows with sparse token budgets
- Potential tools/workflows: integrated OV-Encoder backbones in enterprise LLMs; pipelines that join DocVQA, ChartQA, and VideoQA for incident analysis
- Assumptions/dependencies: secure data lakes; robust RAG across multimodal content; alignment tuning on proprietary domains
- Green Data Centers: capacity planning and scheduling optimized for sparse multimodal workloads to reduce energy and cooling
- Potential tools/workflows: cluster schedulers aware of patch budgets; SLA definitions based on sparse throughput; carbon accounting integrated with ML jobs
- Assumptions/dependencies: observability across codecs and ML layers; organizational incentives for sustainability; interoperability with cloud providers
Notes on cross-cutting assumptions:
- Codec compatibility: the approach relies on motion vectors and residuals; while HEVC is central, similar signals exist in AV1/VP9 but require engineering to extract and align.
- Licensing/IP: HEVC may involve patent licensing; organizations should assess legal implications for production deployments.
- Data quality and bias: web-scale pretraining introduces biases; domain-specific fine-tuning and audits are recommended for sensitive sectors (healthcare, public safety).
- Integration with LMMs: performance gains depend on careful alignment tuning (e.g., LLaVA-Next-Videos) and appropriate patch budgets; retraining may be needed for target domains.
- Hardware constraints: on-edge deployments must ensure decode access to motion/residual signals and maintain low-latency tokenization and inference.
Glossary
- 3D-RoPE: A three-dimensional Rotary Position Embedding that encodes relative temporal and spatial offsets to support attention over sparse, irregular video tokens. "3D-RoPE for Codec Patchification."
- Attentive probe: An evaluation setup where a lightweight attention head probes fixed (frozen) backbone features to assess representation quality. "We report top-1 accuracy (%) using an attentive probe with frozen backbones,"
- Attentive Pooling Head: A multi-head attention module used to aggregate spatiotemporal tokens into compact class embeddings. "Attentive Pooling Head."
- Bi-directional attention-based vision encoder: A vision encoder that uses attention in both directions across tokens to model dependencies for images and videos. "OV-Encoder provides a bi-directional attention-based vision encoder that effectively supports image and video understanding."
- Bitstream: The compressed sequence of coded data (from a video codec) that represents frames via motion vectors and residuals. "which is represented in the bitstream by motion vectors and a residual signal."
- Chunk-wise Patchification: A codec-aligned temporal sampling strategy that divides videos into chunks and patchifies one sampled frame per chunk with chunk-level positional encoding. "Chunk-wise Patchification: a codec-inspired temporal patchification scheme that partitions video streams into fixed-length chunks and constructs patch-level representations with chunk-level positional encoding."
- Cluster discrimination objective: A self-supervised learning objective that contrasts samples against semantic cluster centroids to enforce structured representation learning. "we adopt a self-supervised cluster discrimination objective"
- Codec Patchification: A codec-inspired input formulation that selects and organizes informative patches using codec-derived temporal signals. "We introduce Codec Patchification, a codec-inspired input formulation that leverages codec-derived temporal signals to selectively encode informative visual patches (3.1\%-25\%) from dense video, while unifying video, chunk-wise sampling, and single-image inputs with 3D-RoPE."
- Coding units (CUs): Variable-sized blocks in HEVC used for motion estimation/compensation, where all pixels in a block share the same motion vector. "P-frames are partitioned into coding units (CUs) with variable sizes ranging from to , and all pixels within a CU share the same motion vector."
- Concept bank: A large global set of clustered semantic centers used as anchors/targets for discrimination during training. "a global concept bank of clustered centers"
- Contrastive learning: A paradigm that learns by bringing semantically similar pairs together and pushing dissimilar ones apart, often using instance-level supervision. "contrastive learning paradigms (e.g., CLIP, SigLIP) focus on instance-level discrimination"
- Counterfactual Motion Replacement: An intervention that replaces motion patches with counterfactual ones to test causal reliance on motion signals. "Counterfactual Motion Replacement (50\%)"
- Dense Video-Codec Patchification: A HEVC-inspired formulation that preserves dense temporal coverage while selecting only salient patches from predicted frames. "Dense Video-Codec Patchification: a codec-inspired video encoding formulation that leverages motion-centric temporal signals exposed by P-frames to patchify selected visual regions (3.1\%-25\%) in dense video inputs, while preserving dense temporal coverage."
- Group of Pictures (GOP): A codec structure that segments video into groups containing one intra-coded frame and multiple predicted frames. "each video is divided into Groups of Pictures (GOP),"
- HEVC (High Efficiency Video Coding): A modern video compression standard (H.265) that uses inter-frame prediction via motion vectors and residuals. "H.265/HEVC (High Efficiency Video Coding)"
- I-frame: An intra-coded frame that encodes a full image to establish global spatial context within a GOP. "intra-coded frames (I-frames) that establish global context"
- Luma residual: The luminance component of the codec residual decoded into the pixel domain as a measure of unpredictable appearance change. "we decode the luma residual into the pixel domain"
- Modality-agnostic: Designed to work uniformly across different data modalities (e.g., images and videos) without modality-specific changes. "enabling structured and modality-agnostic visual representation learning."
- Motion compensation: A codec mechanism that predicts current frames from reference frames using estimated motion, with remaining errors stored as residuals. "encode inter-frame variations via motion compensation and residuals"
- Motion vectors: Displacement vectors representing block-level motion between frames used for inter-frame prediction. "motion is represented by motion vectors "
- Native-resolution processing: A strategy that processes inputs at their original resolution to preserve fine details. "together with a native-resolution processing strategy."
- Non-motion Patch Replacement: An intervention that replaces non-motion patches to evaluate the importance of selected motion patches. "Non-motion Patch Replacement (50\%)"
- Object permanence: The property of objects persisting over time, used here as a target for temporally coherent representations. "jointly capturing object permanence and motion dynamics."
- Patch budget: A constraint on the number of patches/tokens allowed, controlling compute while scaling frames. "Patch budgets of 512/1024/2048/4096 correspond to 2/4/8/16 video frames, respectively."
- Patchification: The process of dividing images/frames into fixed-size patches to form token sequences for transformer encoders. "let denote patchification with patch size "
- P-frame: A predicted frame encoded via motion-compensated differences relative to reference frames. "predicted frames (P-frames) that encode inter-frame variations via motion compensation and residuals"
- Residual signal: The part of a frame not explained by motion compensation, capturing appearance changes. "a residual signal that captures appearance changes not explained by motion compensation"
- RoPE (Rotary Position Embedding): A positional encoding method that represents relative positions through rotations in the embedding space. "3D Rotary Position Embedding (RoPE)"
- Saliency score: A measure computed per patch (from motion magnitude and residual energy) to select informative regions. "we compute a patch level saliency score by aggregating the codec exposed motion magnitude and residual energy"
- Signal entropy: A measure of unpredictability/information content in a region, guiding sparse computation. "regions rich in signal entropy."
- Sparse Patch Selection: The process of selecting only a fixed proportion of salient patches based on codec-derived cues. "Sparse Patch Selection."
- Token budget: A cap on the total number of tokens processed, often set per clip to control efficiency. "Under our default setting (64 frames, GOP size 32, token budget 2048, )"
- Top-1 accuracy: The percentage of samples where the top predicted label matches the ground truth. "We report top-1 accuracy (\%)"
- Video–language alignment: The alignment of video representations with language, enabling multimodal understanding. "video--language alignment."
- Vision backbone: The core feature extractor (e.g., ViT) within a larger multimodal model. "vision backbones such as Qwen3-ViT and SigLIP2"
- Visual tokens: Patch-level tokens representing visual content for transformer processing. "despite using substantially fewer visual tokens"
Collections
Sign up for free to add this paper to one or more collections.