MolmoAct2-BimanualYAM: Universal Codec Framework

Updated 3 July 2026

The paper introduces a universal visual codec that unifies compression for both human viewing and machine tasks by aligning neural representations with codec signal sparsity.
It employs codec-aligned tokenization, integration of multi-modal side information, and temporal conditioning to optimize rate–distortion–perception trade-offs.
Quantitative evaluations show significant improvements in PSNR, mAP, and latency over traditional codecs, demonstrating enhanced perceptual quality and scalability.

A universal visual codec is an integrated computational framework designed to process, compress, and transmit visual data—images and/or video—across diverse use cases, codecs, modalities, and consumption scenarios for both human users and machine agents. The universal visual codec paradigm aspires to unify the goals of efficient compression, rich semantic preservation, interoperability, and adaptability, thereby subsuming and generalizing traditional single-purpose codecs. Modern instantiations combine information-theoretic principles, deep neural architectures, task-driven representations, and cross-modality priors to deliver optimized rate–distortion–perception (RDP) trade-offs and seamless compatibility with a heterogeneous codec ecosystem (Chen et al., 2021, Gao et al., 2024, Tang et al., 9 Feb 2026, Zhang et al., 5 Mar 2026).

1. Foundational Principles and Motivations

Universal visual codecs are grounded in the equivalence between visual understanding and predictive compression. This is formalized by identifying “surprise”—the unpredictable residual in visual signals, encoded by standard codecs’ motion vectors and residuals—as the true carrier of semantic and discriminative information. The universal codec focuses computation and representation on these high-information regions. The overarching goal is to achieve maximum compression by aligning neural representations with codec-exposed signal sparsity, while also supporting downstream visual analysis directly at the bitstream or intermediate representation stage (Tang et al., 9 Feb 2026).

A key motivation arises from the proliferation of use scenarios, such as human-in-the-loop viewing, automated surveillance, autonomous navigation, and machine perception at scale. Existing codecs are typically tuned for either peak signal-to-noise ratio (PSNR), perceptual quality, or machine task fidelity, but not all simultaneously. The universal visual codec seeks to balance and adapt these requirements, achieving a scalable, modular, and extensible compression framework that supports both human and machine-centric goals (Chen et al., 2021, Gao et al., 2024).

2. Core Methodologies and Architectures

Several methodological pillars characterize universal visual codec design.

2.1 Codec-Aligned Tokenization and Predictive Sparsity

As introduced in OneVision-Encoder (Tang et al., 9 Feb 2026), visual inputs (images or videos) are partitioned into local patches, and a codec-aligned saliency score is computed for each patch using motion vectors and residual energy from compressed bitstreams (e.g., HEVC/H.264). Only patches with top residual energy—typically 3.1%–25% of the total—are selected, resulting in an explicit sparsity ratio:

$s = \frac{|\textrm{active patches}|}{|\textrm{total patches}|}\,, \qquad s \in [0.031, 0.25]$

Spatial–temporal reasoning is unified via 3D rotary positional encoding (3D-RoPE) applied to the indices $(t,x,y)$ of each selected patch, allowing the transformer backbone to process irregular layouts and preserve context.

The UniMIC framework (Gao et al., 2024) proposes a visual codec repository: a modular “plug-in” architecture aggregating traditional (JPEG, VVC, etc.) and neural codecs as base models. Workflows flexibly accommodate any base codec and bitrate, with no retraining. To bridge semantic gaps and support perceptual refinement, multi-grained textual coding injects automatically generated content prompts and compression prompts, both compressed and transmitted as side information.

2.3 Unified Intra/Inter Coding and Temporal Conditioning

Uni-LVC (Zhang et al., 5 Mar 2026) extends universal codec properties to both image (intra) and video (inter) compression within a single architecture. Built atop a powerful intra-codec, the model handles inter coding as intra coding conditioned on temporal embeddings extracted from reference frames. The integration utilizes a cross-attention adaptation module, reliability-aware gating, and a multistage curriculum for seamless support of low-delay, random-access, and variable-rate scenarios.

2.4 Semantic Profiling and Task-Driven Bitstream Decomposition

Other paradigms, as in (Chen et al., 2021), explicitly profile high-level scene semantics (instance segmentation) and low-level features, encoding each as separate lossless or lossy streams. This enables tiered transmission: minimal bitstreams for machine understanding (e.g., object detection, segmentation) and full bitstreams for high-fidelity human viewing, with intermediate levels accommodating both task classes.

3. Mathematical Formulations of Universal Visual Codec Pipelines

Universal codec encoders and decoders follow structured mathematical workflows.

3.1 Predictive Codec Patchification (OneVision-Encoder)

Given video frames, patches are scored:

$\textrm{score}_k = \sum_{(x,y)\in\textrm{patch}_k} \Vert d(x,y)\Vert_2 + \lambda \sum_{(x,y)\in\textrm{patch}_k} |R(x,y)|^2$

where $d(x,y)$ denotes motion vectors and $R(x,y)$ denotes residuals. Patches with the highest scores are selected, forming the active patch set $\Omega$ . The chosen patches are embedded, assigned a 3D-RoPE positional encoding, and processed by a ViT-style backbone.

3.2 Unified Rate–Distortion–Perception Objective (UniMIC)

Training objective for perceptual-conditioned refinement:

$\mathcal{L}(\theta) = R_b(x_v) + R_t(\textrm{ConP}, \textrm{ComP}) + \lambda D(x, x_v) + \mu \mathbb{E}_{t,\epsilon}\|\epsilon-\epsilon_\theta(z_t, t, C)\|^2$

where $R_b$ is base codec bitrate, $R_t$ is textual side-information cost, $D(x, x_v)$ is pixel-wise distortion, and the last term is a diffusion-based perceptual loss under conditioning from content and compression prompts.

3.3 Intra/Inter Conditioning, Cross-Attention, and Adaptive Gating (Uni-LVC)

Temporal conditioning introduces reliability-aware gating:

$(t,x,y)$ 0

with $(t,x,y)$ 1 learned by a classifier. The encoder operates as $(t,x,y)$ 2, and falls back to intra mode for unreliable references. Cross-attention operates in both local deformable and global polarity-aware branches to combine spatial and temporal cues adaptively.

3.4 Semantic Indexing, Profiling, and Multi-Stream Transmission

In semantic-profiling-based codecs, the segmentation/instance mask is indexed:

$(t,x,y)$ 3

with $(t,x,y)$ 4 as class and $(t,x,y)$ 5 as instance ID, compactly encoded using lossless codecs (e.g., FLIF).

4. Implementation Details and Network Designs

A range of architectural choices realize the universal visual codec paradigm:

OneVision-Encoder employs a ViT-Large-style transformer (24 layers, 1024-dim, 16 heads), embedding only codec-selected sparse patches via linear projections. Unified 3D-RoPE encodes spatial-temporal relationships.
UniMIC leverages a frozen Stable Diffusion 2.1 UNet and VAE as the “universal perception compensator”. Adaptation is achieved by text-conditioning (CLIP) and a tiny MLP adapter modulating intermediate features, with the VAE decoder fine-tuned and skip connections for improved detail preservation.
Uni-LVC architecture consists of pixel-unshuffle and depthwise-conv blocks for the encoder, hierarchical progressive context models (HPCM) for entropy coding, and lattice vector quantization (LVQ). Temporal reference integration is achieved by deformable and polarity-aware cross-attention modules at multiple network locations. A reliability classifier adaptively controls temporal conditioning, and the loss function combines rate, distortion, classification, and regularization terms.
Semantic Profiling Codecs utilize Mask R-CNN for semantic extraction, a small convolutional network for low-level features, and a cascade of lossless and lossy coders (FLIF and VVC) for multistream encoding.

5. Quantitative Comparisons and Performance Characteristics

Universal visual codecs yield state-of-the-art results in both image and video benchmarks, often outperforming specialized codecs.

5.1 Image and Video Rate-Distortion

On COCO17-Val and Kodak, semantic-profiling codecs achieve $(t,x,y)$ 6dB PSNR better than JPEG2000 and BPG at low bitrates (e.g., $(t,x,y)$ 7bpp: JPEG2000 $(t,x,y)$ 827.9 dB, BPG $(t,x,y)$ 928.4 dB, Ours $\textrm{score}_k = \sum_{(x,y)\in\textrm{patch}_k} \Vert d(x,y)\Vert_2 + \lambda \sum_{(x,y)\in\textrm{patch}_k} |R(x,y)|^2$ 029.2 dB) (Chen et al., 2021).
For instance segmentation and detection, universal codecs deliver higher mAP under minimal-rate semantic streams: mAP 42.8% vs. 33.1% (BPG) for detection and 36.4% vs. 29.2% (BPG) for segmentation at $\textrm{score}_k = \sum_{(x,y)\in\textrm{patch}_k} \Vert d(x,y)\Vert_2 + \lambda \sum_{(x,y)\in\textrm{patch}_k} |R(x,y)|^2$ 10.02 bpp.
UniMIC achieves up to −96.9% FID reduction on top of VTM, and consistent large LPIPS gains across eight base codecs and datasets with minimal (3–5%) rate overhead for text (Gao et al., 2024).
OneVision-Encoder demonstrates $\textrm{score}_k = \sum_{(x,y)\in\textrm{patch}_k} \Vert d(x,y)\Vert_2 + \lambda \sum_{(x,y)\in\textrm{patch}_k} |R(x,y)|^2$ 2 average improvement over strong ViT and SigLIP2 baselines on video understanding tasks under identical token budgets, and $\textrm{score}_k = \sum_{(x,y)\in\textrm{patch}_k} \Vert d(x,y)\Vert_2 + \lambda \sum_{(x,y)\in\textrm{patch}_k} |R(x,y)|^2$ 3– $\textrm{score}_k = \sum_{(x,y)\in\textrm{patch}_k} \Vert d(x,y)\Vert_2 + \lambda \sum_{(x,y)\in\textrm{patch}_k} |R(x,y)|^2$ 4 gains on hard benchmarks with $\textrm{score}_k = \sum_{(x,y)\in\textrm{patch}_k} \Vert d(x,y)\Vert_2 + \lambda \sum_{(x,y)\in\textrm{patch}_k} |R(x,y)|^2$ 5– $\textrm{score}_k = \sum_{(x,y)\in\textrm{patch}_k} \Vert d(x,y)\Vert_2 + \lambda \sum_{(x,y)\in\textrm{patch}_k} |R(x,y)|^2$ 6 fewer tokens (Tang et al., 9 Feb 2026).
Uni-LVC outperforms prior state-of-the-art LVCs in both intra- and inter-mode: BD-Rate of $\textrm{score}_k = \sum_{(x,y)\in\textrm{patch}_k} \Vert d(x,y)\Vert_2 + \lambda \sum_{(x,y)\in\textrm{patch}_k} |R(x,y)|^2$ 7 vs. VTM on intra (AI), $\textrm{score}_k = \sum_{(x,y)\in\textrm{patch}_k} \Vert d(x,y)\Vert_2 + \lambda \sum_{(x,y)\in\textrm{patch}_k} |R(x,y)|^2$ 8 on low-delay, $\textrm{score}_k = \sum_{(x,y)\in\textrm{patch}_k} \Vert d(x,y)\Vert_2 + \lambda \sum_{(x,y)\in\textrm{patch}_k} |R(x,y)|^2$ 9 (better than DCVC-B) on random-access (Zhang et al., 5 Mar 2026).

5.2 Efficiency and Scalability

Model sizes of universal visual codecs are compact: Uni-LVC is $d(x,y)$ 0M parameters for full AI+temporal, $d(x,y)$ 1 smaller than older heavy models.
Latency remains practical: Uni-LVC achieves intra-encode and decode times of 0.071 s and 0.062 s per 1080p frame, compared to 0.022 s of DCVC-RT.
Peeling and compositional workflows offer scalability: tiered bitstream transmission in semantic profiling codecs enables trade-offs between bit cost and consumability (machine vs. human) (Chen et al., 2021).

Codec Framework	Target Modalities	Main Strengths
OneVision-Encoder	Video/Image/Doc	Predictive sparsity, multimodal adaptation
UniMIC	Image (all codecs)	Plug-in flexibility, perceptual refinement
Uni-LVC	Video (AI/LD/RA)	Unified intra/inter, rate-distortion gains
Semantic Profiling	Image	Task-driven tiered bitstreams

6. Applications, Interoperability, and Extensibility

Universal visual codecs demonstrate broad interoperability and extensibility:

Task-driven scalability: Tiered or “peeled” bitstreams enable minimal transmission for machine understanding and progressive refinement for human consumption, supporting scenarios from automated analytics to high-fidelity display.
Codec-agnostic integration: UniMIC accommodates unseen codecs and novel quality levels, enabling drop-in compatibility with legacy infrastructure and generalization to new coding standards (Gao et al., 2024).
Cross-modal grounding: Textual side information (prompts) and semantic labels can be directly injected, supporting joint retrieval, captioning, and cross-modal downstream vision-language tasks.
Document and video adaptation: Sparse, codec-aligned encoding natively supports irregular visual data (e.g., scanned documents, video sequences with motion).

A plausible implication is that as application heterogeneity increases, universal visual codecs will subsume specialized, per-task codecs, driven by their demonstrated ability to simultaneously optimize rate, distortion, and perception for both humans and machines.

7. Challenges and Prospects

Current universal visual codecs achieve or surpass legacy codecs in quantitative performance, but open challenges remain:

Scalability to ultra-high resolution, HDR, or complex color formats—a direction suggested in the future prospects of Uni-LVC.
Task-specific compression: Emerging needs for semantics-preserving or privacy-preserving coding (e.g., class labels, OCR tags) can be natively integrated (Tang et al., 9 Feb 2026).
Latency vs. generality trade-off: While universal codecs are efficient, highly specialized codecs may exhibit lower raw latency; optimizing for deployment-specific constraints is an ongoing area of improvement.
Unified evaluation protocols: As codecs become more universal, comprehensive benchmarks covering machine and human metrics (e.g., PSNR/MS-SSIM, FID/LPIPS, mAP, latency, bitstream cost) are necessary for objective comparison.

With continuing advances in discriminative and generative modeling, compositional conditioning, and semantic-task-aware design, the universal visual codec is positioned as a foundational component for future multimodal and perceptual AI systems. The research trajectory suggests further convergence across codecs, modalities, and consumption scenarios, with codec-aligned sparsity emerging as a guiding principle for compression and understanding (Tang et al., 9 Feb 2026, Chen et al., 2021, Gao et al., 2024, Zhang et al., 5 Mar 2026).

Markdown Report Issue Upgrade to Chat

References (4)

A New Image Codec Paradigm for Human and Machine Uses (2021)

UniMIC: Towards Universal Multi-modality Perceptual Image Compression (2024)

OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence (2026)

Uni-LVC: A Unified Method for Intra- and Inter-Mode Learned Video Compression (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MolmoAct2-BimanualYAM.