Progressive Visual Compression
- Progressive Visual Compression is a technique that incrementally decodes visual data by transmitting prioritized bitstream fragments to enhance image, video, and 3D data fidelity.
- PVC leverages transform coding, learned autoencoders, and token aggregation to enable scalable, adaptive compression across diverse visual modalities.
- PVC methodologies prioritize rate–distortion optimization and structured encoding, supporting applications from standard imagery to interactive volumetric video streaming.
Progressive Visual Compression (PVC) refers to a class of visual coding methods—spanning images, video, 3D/4D geometric data, and tokenized representations for vision-LLMs—that enable incremental transmission and decoding of visual signals. In PVC, each additional fragment of the bitstream brings improved fidelity, with the architecture and bitstream designed so that any prefix yields a plausible approximation. PVC subsumes and generalizes classic transform coding, learned autoencoders with scalable bitstreams, trit-plane decompositions, nested quantization, hierarchical token aggregation, and adaptive masking or prioritization schemes.
1. Key Principles and Taxonomy of PVC
PVC is built upon the idea of progressive refinement: the decoder reconstructs intermediate approximations by partially decoding the bitstream up to the current truncation point. This paradigm was first evident in transform coding approaches, such as JPEG (DCT-based) and JPEG2000 (wavelet-based), where low-frequency coefficients are transmitted first (Pandharkar et al., 2011). Modern neural approaches extend PVC to learned latent spaces, enabling fine-grained scalability and prioritization based on rate–distortion criteria (Lee et al., 2021, Lu et al., 2021, Jeon et al., 2023, Hojjat et al., 2023, Presta et al., 15 Nov 2024).
Core variants include:
- Hierarchical (coarse-to-fine) coding: Multi-scale decomposition in latent or spatial domains (MSP model) (Zhang et al., 2022), voxel or Gaussian layers for volumetric video (Zheng et al., 22 Sep 2025), or visual tokens in VLMs (Yang et al., 12 Dec 2024, Sun et al., 26 Nov 2025).
- Plane-wise decomposition: Bit-plane (binary), trit-plane (ternary), or slice-wise latent structuring, often with RD-prioritized ordering (Lee et al., 2021, Jeon et al., 2023).
- Token aggregation: Progressive selection and aggregation of patch, window, or frame tokens for native-resolution ViTs or VLM encoders (Sun et al., 26 Nov 2025, Yang et al., 12 Dec 2024, Li et al., 1 Apr 2025).
- Residual and masking strategies: Structured residual extraction and masking based on entropy or variance, yielding element-wise granularity (Presta et al., 15 Nov 2024).
PVC supports fine-grained scalability, variable-rate operation (via truncation or prioritization), and robust previews under partial reception.
2. Methodologies for PVC: Transform Coding, Neural, and Geometric Approaches
Classic PVC via Transform Coding
Traditional PVC adopts orthonormal basis projections—DCT, wavelets, Fourier—emitting coefficients in ascending significance order (Pandharkar et al., 2011). For -dimensional visual signals , measurements use the first rows of (basis functions), yielding progressive fidelity by coefficient truncation:
The approach is robust for 2D images, less so for high-dimensional data (multispectral/light-field), where randomized projections or sparsity-adaptive methods are preferred.
Neural Autoencoder and Latent Plane PVC
Modern PVC deploys end-to-end learned autoencoders with hierarchical or progressive bitstream organization:
- Trit-plane Coding (DPICT, CTC): Images are encoded into quantized latents , decomposed into trit-planes, transmitting more significant trits first. RD-priority ordering ensures optimal per-bit distortion reduction (Lee et al., 2021, Jeon et al., 2023).
- Nested Quantization (PLONQ): Latent tensors are encoded with multiple quantization grids (scaling factors ). Each refinement step decodes the difference between coarser and finer grids. Embedded ordering yields up to quality levels (Lu et al., 2021).
- Contextual Probability/Distortion Modules: Probability estimation and partial-tensor refinement using convolutional context (CRR/CDR modules) and selective retraining for improved partial decoding (Jeon et al., 2023).
- Residual/Masked Granularity: Split latents into base, top, and residual representations; transmit residuals using variance-aware prioritization and entropy module refinement (Presta et al., 15 Nov 2024).
PVC in Tokenized Vision Representations
Vision-LLMs (VLMs) require progressive condensation and adaptation of visual tokens for efficient multimodal integration:
- Hierarchical Compression in ViT: Windowed compression modules progressively merge patch tokens at defined transformer layers; refined patch embedding adapts patch size without retraining, yielding large-scale native-resolution encoding and up to token reduction (Sun et al., 26 Nov 2025).
- Temporal Compression for Unified Image/Video: Treat images as repeated static video frames; per-frame progressive transformer blocks use causal temporal attention and AdaLN for slice-sensitive adaptation (Yang et al., 12 Dec 2024).
- Selective Token Aggregation (QG-VTC): Question-guided correlation scoring identifies most relevant vision tokens, recycles non-selected ones via self-attention, and progressively prunes across layers (Li et al., 1 Apr 2025).
Geometric/Volumetric PVC
PVC schemes for mesh and volumetric video adopt layer-wise multi-resolution or hierarchical Gaussian primitives:
- Irregular Multi-Resolution Mesh Analysis: Progressive simplification via lifting schemes, with adaptive quantization per vertex and layered bitstreams (Abderrahim et al., 2013).
- 4D Gaussian Hierarchical Coding: Layered primitive partitioning based on perceptual significance, motion-adaptive grouping, and attribute-specific entropy modeling for time-flexible volumetric streaming (Zheng et al., 22 Sep 2025).
3. Rate–Distortion Prioritization, Ordering, and Bitstream Organization
Progressive transmission in PVC is governed by rigorous prioritization mechanisms:
- Rate–Distortion (RD) Sorting: Quantify per-element and greedily transmit bits or trits delivering maximal distortion improvement per bit spent (Lee et al., 2021, Lu et al., 2021, Jeon et al., 2023).
- Latent and Token Ordering: Elements are sorted using local entropy metrics (e.g., ) or explicit evaluations; blocks/channels with highest informativeness are sent first.
- Hierarchical/Layered Embedding: Bitstreams are organized into multi-layered packets—scales, planes, windows, granularity levels—with each layer augmenting fidelity and often corresponding to explicit computational units (scales, slices, tokens) (Zhang et al., 2022, Yang et al., 12 Dec 2024, Sun et al., 26 Nov 2025, Zheng et al., 22 Sep 2025).
- Rate Enhancement and Context Modules: Remedial networks (REMs) and slice-context modules refine entropy parameter estimates at progressive checkpoints, maintaining RD-optimality under adaptive transmission (Presta et al., 15 Nov 2024, Jeon et al., 2023).
4. Quantitative Performance, Scalability, and Computational Complexity
PVC systems are quantitatively benchmarked in terms of RD performance (PSNR, MS-SSIM, BD-rate), scalability (number of supported quality points), and compute/resource efficiency.
- Progressive Range and Fine Granular Scalability (FGS): DPICT supports 164 distinct rates, and PLONQ achieving 20–30 discrete points per bitstream (Lee et al., 2021, Lu et al., 2021).
- RD Gains: DPICT offers +1.7 dB PSNR gain over JPEG2000 FGS at 0.75 bpp and +1.1 dB MS-SSIM over RNN-based codecs; CTC achieves −14.84% BD-rate on Kodak (Lee et al., 2021, Jeon et al., 2023).
- Token Compression/MLLMs: QG-VTC maintains >99% accuracy with 1/4 tokens and 94% with 1/8, at 30% computational cost (Li et al., 1 Apr 2025); LLaVA-UHD v3 reduces TTFT by – versus prior art (Sun et al., 26 Nov 2025).
- Computational Complexity: MSP reduces decoding complexity from to , yielding 20 speedup compared to standard CNN or PixelCNN models (Zhang et al., 2022). PVC with windowed token compression gives 3–4 cost reduction in ViT self-attention (Sun et al., 26 Nov 2025).
- Volumetric Streaming: 4DGCPro achieves +2–7 dB BD-PSNR over benchmarks with real-time rendering (10–43 ms/frame) even on mobile platforms (Zheng et al., 22 Sep 2025).
5. Training Protocols, Losses, and Architectural Adaptations
PVC models are typically trained using standard rate–distortion objectives, sometimes augmented for progressive behavior:
- Single-Rate or Multi-Rate Training: PLONQ and DPICT use standard RD loss (e.g., ); PVC-adapted training can insert drop/block/masking or progressive scheduling (double-tail-drop in ProgDTD) (Lee et al., 2021, Lu et al., 2021, Hojjat et al., 2023).
- Progressive Learning Paradigms: Visual prompt tuning modules (LPM) adapt transformer blocks for variable-rate compression, with only a fraction of data/parameters required for each rate, yielding 80% model storage and 90% dataset savings over conventional multi-rate methods (Qin et al., 2023).
- Universal and Residual Quantization: UQDM replaces Gaussian with uniform channels in diffusion models, allowing universal quantization and a direct compression cost via the negative ELBO (Yang et al., 14 Dec 2024). Residual masking and slice-wise entropy modules yield element-wise progressive scalability (Presta et al., 15 Nov 2024).
- Context-driven Refinements: Context-based modules read previous partial information at each plane or slice (CRR, CDR, REMs) to sharpen probabilities and reduce distortion in partial reconstructions (Jeon et al., 2023, Presta et al., 15 Nov 2024).
6. Limitations, Open Directions, and Extensions
PVC introduces additional overheads and challenges:
- Sorting and Coding Complexity: RD-prioritized sorting and dynamic arithmetic coding can be costly for high-dimensional latents (Lee et al., 2021, Jeon et al., 2023, Presta et al., 15 Nov 2024).
- Training Mismatch: Models trained at full rates may exhibit poor behavior at partial rates, necessitating post-processing or multi-rate finetuning (Lee et al., 2021, Jeon et al., 2023).
- Perceptual Fidelity under Progressiveness: Current objectives prioritize MSE or MS-SSIM; GAN/perceptual losses and subjective metrics under FGS remain open directions (Lee et al., 2021).
- Extension to Video and 4D Data: Adapting trit-plane or token-compression schemes to temporal, multispectral, and volumetric data necessitates careful management of inter-frame motion, grouping, and coherence (Yang et al., 12 Dec 2024, Zheng et al., 22 Sep 2025).
- VLM Scalability: PVC for VLMs operates orthogonally to model scale and can be plug-and-play for various architectures; dynamic token budgets and adaptive frame selection for ultra-long sequences are active areas (Sun et al., 26 Nov 2025, Yang et al., 12 Dec 2024).
- Hardware Integration: True progressive previewing in mesh/volumetric codecs is gated by GPU decoding and streaming hardware; all proposed methods are implementable on commodity platforms (Abderrahim et al., 2013, Zheng et al., 22 Sep 2025).
7. Representative PVC Methods: Summary Table
| Core Methodology | Principle | Scalability & Efficiency |
|---|---|---|
| DPICT (Lee et al., 2021) | Trit-plane coding + RD sort | 164 rates, +1.7dB PSNR over JPEG2000, FGS, small postprocessor |
| PLONQ (Lu et al., 2021) | Nested quantization + ordering | 20–30 embedded points, 0.3–0.5dB PSNR gain over SPIHT |
| CTC (Jeon et al., 2023) | Context-based modules | –14.84% BD-rate, marginal time overhead |
| ProgDTD (Hojjat et al., 2023) | Double-tail-drop regularization | O(1) param, MS-SSIM ≈ oracle, highly customizable |
| MSP + LOF (Zhang et al., 2022) | Multi-scale, O(1) decoder | 20× decode speedup, –2.5% BD-rate vs. VVC/H.266 |
| LLaVA-UHD v3 (Sun et al., 26 Nov 2025) | Windowed token compression | 64× token reduction, 1.9–2.4× TTFT cut, patch-size adaptable |
| QG-VTC (Li et al., 1 Apr 2025) | Question-guided token sel. | 1/8 tokens, 94.3% VQA acc., 30% cost |
| PVC-VLM (Yang et al., 12 Dec 2024) | Unified image/video token | 64/frame, SOTA on MVBench, DocVQA etc., minimal image loss |
| 4DGCPro (Zheng et al., 22 Sep 2025) | Hierarchical 4D Gaussian | Real-time decode, +2–7dB BD-PSNR, mobile-ready |
| PVC-residual (Presta et al., 15 Nov 2024) | Variance-aware masking | Competitive RD, 2× speedup, no extra param |
References
- (Lee et al., 2021): DPICT: Deep Progressive Image Compression Using Trit-Planes
- (Lu et al., 2021): Progressive Neural Image Compression with Nested Quantization and Latent Ordering
- (Jeon et al., 2023): Context-Based Trit-Plane Coding for Progressive Image Compression
- (Hojjat et al., 2023): ProgDTD: Progressive Learned Image Compression with Double-Tail-Drop Training
- (Zhang et al., 2022): Leveraging Progressive Model and Overfitting for Efficient Learned Image Compression
- (Sun et al., 26 Nov 2025): LLaVA-UHD v3: Progressive Visual Compression for Efficient Native-Resolution Encoding in MLLMs
- (Li et al., 1 Apr 2025): QG-VTC: Question-Guided Visual Token Compression in MLLMs for Efficient VQA
- (Yang et al., 12 Dec 2024): PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-LLMs
- (Zheng et al., 22 Sep 2025): 4DGCPro: Efficient Hierarchical 4D Gaussian Compression for Progressive Volumetric Video Streaming
- (Presta et al., 15 Nov 2024): Efficient Progressive Image Compression with Variance-aware Masking
- (Pandharkar et al., 2011): Progressive versus Random Projections for Compressive Capture of Images, Lightfields and Higher Dimensional Visual Signals
- (Abderrahim et al., 2013): Progressive Compression of 3D Objects with an Adaptive Quantization
- (Qin et al., 2023): Progressive Learning with Visual Prompt Tuning for Variable-Rate Image Compression
PVC unifies diverse approaches for incremental fidelity in visual data transmission, supporting contemporary machine learning workloads that demand both scalability and efficiency within shared frameworks.