Progressive Visual Compression

Updated 30 November 2025

Progressive Visual Compression is a technique that incrementally decodes visual data by transmitting prioritized bitstream fragments to enhance image, video, and 3D data fidelity.
PVC leverages transform coding, learned autoencoders, and token aggregation to enable scalable, adaptive compression across diverse visual modalities.
PVC methodologies prioritize rate–distortion optimization and structured encoding, supporting applications from standard imagery to interactive volumetric video streaming.

Progressive Visual Compression (PVC) refers to a class of visual coding methods—spanning images, video, 3D/4D geometric data, and tokenized representations for vision-LLMs—that enable incremental transmission and decoding of visual signals. In PVC, each additional fragment of the bitstream brings improved fidelity, with the architecture and bitstream designed so that any prefix yields a plausible approximation. PVC subsumes and generalizes classic transform coding, learned autoencoders with scalable bitstreams, trit-plane decompositions, nested quantization, hierarchical token aggregation, and adaptive masking or prioritization schemes.

1. Key Principles and Taxonomy of PVC

PVC is built upon the idea of progressive refinement: the decoder reconstructs intermediate approximations by partially decoding the bitstream up to the current truncation point. This paradigm was first evident in transform coding approaches, such as JPEG (DCT-based) and JPEG2000 (wavelet-based), where low-frequency coefficients are transmitted first (Pandharkar et al., 2011). Modern neural approaches extend PVC to learned latent spaces, enabling fine-grained scalability and prioritization based on rate–distortion criteria (Lee et al., 2021, Lu et al., 2021, Jeon et al., 2023, Hojjat et al., 2023, Presta et al., 15 Nov 2024).

Core variants include:

Hierarchical (coarse-to-fine) coding: Multi-scale decomposition in latent or spatial domains (MSP model) (Zhang et al., 2022), voxel or Gaussian layers for volumetric video (Zheng et al., 22 Sep 2025), or visual tokens in VLMs (Yang et al., 12 Dec 2024, Sun et al., 26 Nov 2025).
Plane-wise decomposition: Bit-plane (binary), trit-plane (ternary), or slice-wise latent structuring, often with RD-prioritized ordering (Lee et al., 2021, Jeon et al., 2023).
Token aggregation: Progressive selection and aggregation of patch, window, or frame tokens for native-resolution ViTs or VLM encoders (Sun et al., 26 Nov 2025, Yang et al., 12 Dec 2024, Li et al., 1 Apr 2025).
Residual and masking strategies: Structured residual extraction and masking based on entropy or variance, yielding element-wise granularity (Presta et al., 15 Nov 2024).

PVC supports fine-grained scalability, variable-rate operation (via truncation or prioritization), and robust previews under partial reception.

2. Methodologies for PVC: Transform Coding, Neural, and Geometric Approaches

Classic PVC via Transform Coding

Traditional PVC adopts orthonormal basis projections—DCT, wavelets, Fourier—emitting coefficients in ascending significance order (Pandharkar et al., 2011). For $N$ -dimensional visual signals $x$ , measurements $y=\Phi x$ use the first $M$ rows of $\Phi$ (basis functions), yielding progressive fidelity by coefficient truncation:

$\hat x = \Phi^{-1} s_{\mathrm{trunc}} = \sum_{i=1}^M s_i B_i$

The approach is robust for 2D images, less so for high-dimensional data (multispectral/light-field), where randomized projections or sparsity-adaptive methods are preferred.

Neural Autoencoder and Latent Plane PVC

Modern PVC deploys end-to-end learned autoencoders with hierarchical or progressive bitstream organization:

Trit-plane Coding (DPICT, CTC): Images are encoded into quantized latents $Y$ , decomposed into $L$ trit-planes, transmitting more significant trits first. RD-priority ordering ensures optimal per-bit distortion reduction (Lee et al., 2021, Jeon et al., 2023).
Nested Quantization (PLONQ): Latent tensors $y$ are encoded with multiple quantization grids (scaling factors $\{s_1,\dots,s_K\}$ ). Each refinement step decodes the difference between coarser and finer grids. Embedded ordering yields up to $K \times N$ quality levels (Lu et al., 2021).
Contextual Probability/Distortion Modules: Probability estimation and partial-tensor refinement using convolutional context (CRR/CDR modules) and selective retraining for improved partial decoding (Jeon et al., 2023).
Residual/Masked Granularity: Split latents into base, top, and residual representations; transmit residuals using variance-aware prioritization and entropy module refinement (Presta et al., 15 Nov 2024).

PVC in Tokenized Vision Representations

Vision-LLMs (VLMs) require progressive condensation and adaptation of visual tokens for efficient multimodal integration:

Hierarchical Compression in ViT: Windowed compression modules progressively merge patch tokens at defined transformer layers; refined patch embedding adapts patch size without retraining, yielding large-scale native-resolution encoding and up to $64\times$ token reduction (Sun et al., 26 Nov 2025).
Temporal Compression for Unified Image/Video: Treat images as repeated static video frames; per-frame progressive transformer blocks use causal temporal attention and AdaLN for slice-sensitive adaptation (Yang et al., 12 Dec 2024).
Selective Token Aggregation (QG-VTC): Question-guided correlation scoring identifies most relevant vision tokens, recycles non-selected ones via self-attention, and progressively prunes across layers (Li et al., 1 Apr 2025).

Geometric/Volumetric PVC

PVC schemes for mesh and volumetric video adopt layer-wise multi-resolution or hierarchical Gaussian primitives:

Irregular Multi-Resolution Mesh Analysis: Progressive simplification via lifting schemes, with adaptive quantization per vertex and layered bitstreams (Abderrahim et al., 2013).
4D Gaussian Hierarchical Coding: Layered primitive partitioning based on perceptual significance, motion-adaptive grouping, and attribute-specific entropy modeling for time-flexible volumetric streaming (Zheng et al., 22 Sep 2025).

3. Rate–Distortion Prioritization, Ordering, and Bitstream Organization

Progressive transmission in PVC is governed by rigorous prioritization mechanisms:

Rate–Distortion (RD) Sorting: Quantify per-element $\Delta D/\Delta R$ and greedily transmit bits or trits delivering maximal distortion improvement per bit spent (Lee et al., 2021, Lu et al., 2021, Jeon et al., 2023).
Latent and Token Ordering: Elements are sorted using local entropy metrics (e.g., $\sigma_i$ ) or explicit $\Delta D/\Delta R$ evaluations; blocks/channels with highest informativeness are sent first.
Hierarchical/Layered Embedding: Bitstreams are organized into multi-layered packets—scales, planes, windows, granularity levels—with each layer augmenting fidelity and often corresponding to explicit computational units (scales, slices, tokens) (Zhang et al., 2022, Yang et al., 12 Dec 2024, Sun et al., 26 Nov 2025, Zheng et al., 22 Sep 2025).
Rate Enhancement and Context Modules: Remedial networks (REMs) and slice-context modules refine entropy parameter estimates at progressive checkpoints, maintaining RD-optimality under adaptive transmission (Presta et al., 15 Nov 2024, Jeon et al., 2023).

4. Quantitative Performance, Scalability, and Computational Complexity

PVC systems are quantitatively benchmarked in terms of RD performance (PSNR, MS-SSIM, BD-rate), scalability (number of supported quality points), and compute/resource efficiency.

Progressive Range and Fine Granular Scalability (FGS): DPICT supports 164 distinct rates, and PLONQ achieving 20–30 discrete points per bitstream (Lee et al., 2021, Lu et al., 2021).
RD Gains: DPICT offers +1.7 dB PSNR gain over JPEG2000 FGS at 0.75 bpp and +1.1 dB MS-SSIM over RNN-based codecs; CTC achieves −14.84% BD-rate on Kodak (Lee et al., 2021, Jeon et al., 2023).
Token Compression/MLLMs: QG-VTC maintains >99% accuracy with 1/4 tokens and 94% with 1/8, at 30% computational cost (Li et al., 1 Apr 2025); LLaVA-UHD v3 reduces TTFT by $1.9\times$ – $2.4\times$ versus prior art (Sun et al., 26 Nov 2025).
Computational Complexity: MSP reduces decoding complexity from $O(n)$ to $O(1)$ , yielding $\sim$ 20 $\times$ speedup compared to standard CNN or PixelCNN models (Zhang et al., 2022). PVC with windowed token compression gives $\sim$ 3–4 $\times$ cost reduction in ViT self-attention (Sun et al., 26 Nov 2025).
Volumetric Streaming: 4DGCPro achieves +2–7 dB BD-PSNR over benchmarks with real-time rendering (10–43 ms/frame) even on mobile platforms (Zheng et al., 22 Sep 2025).

5. Training Protocols, Losses, and Architectural Adaptations

PVC models are typically trained using standard rate–distortion objectives, sometimes augmented for progressive behavior:

Single-Rate or Multi-Rate Training: PLONQ and DPICT use standard RD loss (e.g., $L = D(x,\hat x) + \lambda R(\hat y)$ ); PVC-adapted training can insert drop/block/masking or progressive scheduling (double-tail-drop in ProgDTD) (Lee et al., 2021, Lu et al., 2021, Hojjat et al., 2023).
Progressive Learning Paradigms: Visual prompt tuning modules (LPM) adapt transformer blocks for variable-rate compression, with only a fraction of data/parameters required for each rate, yielding 80% model storage and 90% dataset savings over conventional multi-rate methods (Qin et al., 2023).
Universal and Residual Quantization: UQDM replaces Gaussian with uniform channels in diffusion models, allowing universal quantization and a direct compression cost via the negative ELBO (Yang et al., 14 Dec 2024). Residual masking and slice-wise entropy modules yield element-wise progressive scalability (Presta et al., 15 Nov 2024).
Context-driven Refinements: Context-based modules read previous partial information at each plane or slice (CRR, CDR, REMs) to sharpen probabilities and reduce distortion in partial reconstructions (Jeon et al., 2023, Presta et al., 15 Nov 2024).

6. Limitations, Open Directions, and Extensions

PVC introduces additional overheads and challenges:

Sorting and Coding Complexity: RD-prioritized sorting and dynamic arithmetic coding can be costly for high-dimensional latents (Lee et al., 2021, Jeon et al., 2023, Presta et al., 15 Nov 2024).
Training Mismatch: Models trained at full rates may exhibit poor behavior at partial rates, necessitating post-processing or multi-rate finetuning (Lee et al., 2021, Jeon et al., 2023).
Perceptual Fidelity under Progressiveness: Current objectives prioritize MSE or MS-SSIM; GAN/perceptual losses and subjective metrics under FGS remain open directions (Lee et al., 2021).
Extension to Video and 4D Data: Adapting trit-plane or token-compression schemes to temporal, multispectral, and volumetric data necessitates careful management of inter-frame motion, grouping, and coherence (Yang et al., 12 Dec 2024, Zheng et al., 22 Sep 2025).
VLM Scalability: PVC for VLMs operates orthogonally to model scale and can be plug-and-play for various architectures; dynamic token budgets and adaptive frame selection for ultra-long sequences are active areas (Sun et al., 26 Nov 2025, Yang et al., 12 Dec 2024).
Hardware Integration: True progressive previewing in mesh/volumetric codecs is gated by GPU decoding and streaming hardware; all proposed methods are implementable on commodity platforms (Abderrahim et al., 2013, Zheng et al., 22 Sep 2025).

7. Representative PVC Methods: Summary Table

Core Methodology	Principle	Scalability & Efficiency
DPICT (Lee et al., 2021)	Trit-plane coding + RD sort	164 rates, +1.7dB PSNR over JPEG2000, FGS, small postprocessor
PLONQ (Lu et al., 2021)	Nested quantization + ordering	20–30 embedded points, 0.3–0.5dB PSNR gain over SPIHT
CTC (Jeon et al., 2023)	Context-based modules	–14.84% BD-rate, marginal time overhead
ProgDTD (Hojjat et al., 2023)	Double-tail-drop regularization	O(1) param, MS-SSIM ≈ oracle, highly customizable
MSP + LOF (Zhang et al., 2022)	Multi-scale, O(1) decoder	20× decode speedup, –2.5% BD-rate vs. VVC/H.266
LLaVA-UHD v3 (Sun et al., 26 Nov 2025)	Windowed token compression	64× token reduction, 1.9–2.4× TTFT cut, patch-size adaptable
QG-VTC (Li et al., 1 Apr 2025)	Question-guided token sel.	1/8 tokens, 94.3% VQA acc., 30% cost
PVC-VLM (Yang et al., 12 Dec 2024)	Unified image/video token	64/frame, SOTA on MVBench, DocVQA etc., minimal image loss
4DGCPro (Zheng et al., 22 Sep 2025)	Hierarchical 4D Gaussian	Real-time decode, +2–7dB BD-PSNR, mobile-ready
PVC-residual (Presta et al., 15 Nov 2024)	Variance-aware masking	Competitive RD, 2× speedup, no extra param

References

(Lee et al., 2021): DPICT: Deep Progressive Image Compression Using Trit-Planes
(Lu et al., 2021): Progressive Neural Image Compression with Nested Quantization and Latent Ordering
(Jeon et al., 2023): Context-Based Trit-Plane Coding for Progressive Image Compression
(Hojjat et al., 2023): ProgDTD: Progressive Learned Image Compression with Double-Tail-Drop Training
(Zhang et al., 2022): Leveraging Progressive Model and Overfitting for Efficient Learned Image Compression
(Sun et al., 26 Nov 2025): LLaVA-UHD v3: Progressive Visual Compression for Efficient Native-Resolution Encoding in MLLMs
(Li et al., 1 Apr 2025): QG-VTC: Question-Guided Visual Token Compression in MLLMs for Efficient VQA
(Yang et al., 12 Dec 2024): PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-LLMs
(Zheng et al., 22 Sep 2025): 4DGCPro: Efficient Hierarchical 4D Gaussian Compression for Progressive Volumetric Video Streaming
(Presta et al., 15 Nov 2024): Efficient Progressive Image Compression with Variance-aware Masking
(Pandharkar et al., 2011): Progressive versus Random Projections for Compressive Capture of Images, Lightfields and Higher Dimensional Visual Signals
(Abderrahim et al., 2013): Progressive Compression of 3D Objects with an Adaptive Quantization
(Qin et al., 2023): Progressive Learning with Visual Prompt Tuning for Variable-Rate Image Compression

PVC unifies diverse approaches for incremental fidelity in visual data transmission, supporting contemporary machine learning workloads that demand both scalability and efficiency within shared frameworks.