Visual Gaussian Quantization (VGQ)
- VGQ is a visual quantization framework that represents images through Gaussian primitives defined by position, covariance, and appearance.
- It employs structure-guided allocation and adaptive bitwidth quantization to balance compression rates with detail preservation.
- The method extends to 3D representations and visual tokenization, achieving significant gains in rate-distortion performance.
Visual Gaussian Quantization (VGQ) refers to a family of quantization and compression techniques that exploit the structural, geometric, and statistical properties of Gaussian primitives for efficient representation of images and scenes. By leveraging the properties of 2D and 3D Gaussian splatting, VGQ frameworks optimize for both rate-distortion performance and structural preservation, often using adaptive quantization strategies guided by visual and structural cues derived from the data.
1. Principles of Gaussian Quantization for Visual Data
The core of VGQ is representing visual data as a sum of Gaussian “primitives,” each parameterized by position, covariance (for scale and orientation), and appearance coefficients. For 2D images, this is formalized as
$I(x,y)\;\approx\;\sum_{i=1}^{N} w_i\;\exp\!\Bigl(-\tfrac12\,(d_i)^{T}\,\Sigma_i^{-1}\,(d_i)\Bigr), \quad d_i=\begin{pmatrix}x\y\end{pmatrix}-\mu_i,$
where is the Gaussian center, its covariance, and (or for color) its weight. In practice, is typically stored via the Cholesky factorization for guaranteed positive semidefiniteness (Liang et al., 30 Dec 2025). This framework extends naturally to 3D, where Gaussian splats model radiance fields.
Naïve uniform quantization of these parameters (fixed bitwidths for all attributes and all primitives) induces suboptimal RD trade-offs: low-complexity regions receive excessive precision, while structurally complex regions suffer from under-representation.
2. Structure-Guided Allocation and Initialization
Modern VGQ methodologies employ structure-guided strategies to align representation capacity with image complexity (Liang et al., 30 Dec 2025). The procedure is as follows:
- Compute salient image structure via edge detectors (e.g. Sobel operator) and over-segment the image into superpixels (e.g. SLIC).
- Quantify local complexity within each superpixel as the variance of the gradient magnitude: where .
- Allocate Gaussian primitives adaptively: assign more to high-variance regions and fewer to smooth areas, with allocation ratios smoothly interpolating from highly non-uniform (e.g. 6:2:1 for high:medium:low complexity) towards uniform as total count increases.
- Initialize covariance: scaling Gaussians small in complex regions and large in smooth ones.
This structure-guided policy ensures the representation is locally dense where fine detail is present and sparse in uniform areas, directly coupling image structure to encoding granularity.
3. Adaptive Bitwidth Quantization and Rate-Distortion Optimization
VGQ implements per-primitive, adaptive quantization, crucial for efficient compression without perceptual degradation. The bitwidth allocated to each covariance parameter (and, by extension, each primitive) is set adaptively according to scale and local complexity: such that small, spatially precise Gaussians in complex regions receive higher-precision quantization, whereas large Gaussians in smooth regions receive fewer bits. Quantization is performed by a uniform quantizer: The optimization objective during RD-aware fine-tuning is
where is typically MSE or MS-SSIM, and balances fidelity versus rate (Liang et al., 30 Dec 2025).
The training employs straight-through estimators to make the rounding operation differentiable, allowing joint optimization of positions, covariances, appearance, and quantization bits.
Hierarchical VGQ approaches extend to multi-attribute, multi-level quantization. In 3D Gaussian splatting, this includes inter-attribute (per-channel) and intra-attribute (blockwise) mixed-precision assignment, optimally solved via 0–1 integer linear programming and dynamic programming, respectively (Xie et al., 2024).
4. Geometry-Consistent Regularization and Structural Preservation
Beyond rate and representation, VGQ frameworks impose additional geometry-consistent regularization to encourage shape and orientation alignment between Gaussian primitives and underlying local image gradients. This is formalized as an edge alignment penalty: where is the principal eigenvector of the covariance (primary axis), and is the image gradient at the Gaussian center (Liang et al., 30 Dec 2025). The regularization term is weighted within the loss during initial representation fitting and quantization-aware fine-tuning.
5. Tokenization and Structured Latent Representations
Recent work extends VGQ from compression to visual tokenization by treating each token as a parameterized 2D Gaussian, thereby encoding both structural (position, orientation, scale) and appearance information within the same discrete primitive (Shi et al., 19 Aug 2025). The architecture typically consists of:
- A dual-branch encoder: one branch for appearance codebook quantization, another for geometry codebook quantization.
- Nearest-neighbor assignment to learned codebooks for both geometry and appearance.
- Decoding via “splatted” spatial support of Gaussian tokens into continuous feature maps, followed by fusion (e.g. Hadamard product) and upsampling.
- Supervision via reconstruction, adversarial, and perceptual losses.
Increasing the number of Gaussians per token provides a flexible trade-off between reconstruction fidelity and token efficiency, empirically yielding state-of-the-art reconstruction scores on ImageNet benchmarks (e.g., PSNR 24.93 dB and rFID=0.556 with four Gaussians per token) (Shi et al., 19 Aug 2025).
6. Algorithms, Optimization, and Empirical Performance
VGQ algorithms follow a two-stage regime: structure-guided initialization with geometry-regularized fitting, followed by quantization-aware, RD-prioritized fine-tuning.
Key steps include:
- Extraction of local structural priors and allocation of primitives.
- Geometry-aligned fitting to preserve edges and fine structure.
- Adaptive quantization - learning per-primitive bitwidths with STE-based backpropagation.
- Metadata efficiency - encoding bitwidths at minimal overhead (≤0.0004 bpp in (Liang et al., 30 Dec 2025)).
Empirical results demonstrate substantial performance gains:
| Dataset | Method | BD-rate Reduction | PSNR Gain | Decoding Speed (FPS) |
|---|---|---|---|---|
| Kodak (24 images) | VGQ vs GSImage | 43.44% | +1.32 dB | ≥1700 |
| DIV2K ×2 | VGQ vs GSImage | 29.91% | +1.76 dB | ≥1500 |
Adaptive bitwidths concentrate in complex image regions (12–16 bits) and shrink in smoothness (6–8 bits), amplifying both compression and fidelity (Liang et al., 30 Dec 2025).
7. Extensions, Limitations, and Open Directions
VGQ principles—structure-guided allocation, adaptive quantization, and geometry-consistent regularization—generalize to 3D Gaussian splatting (for video and neural rendering) and to other parametric kernel-based representations (Liang et al., 30 Dec 2025). In 3DGS, hierarchical mixed-precision quantization (inter-/intra-attribute) enables size-constrained, high-fidelity compression with rapid hyperparameter optimization via linear size estimators plus combinatorial solvers (Xie et al., 2024). Vector-quantization frameworks using Gaussian codebooks (e.g., CompGS) achieve compression ratios up to 20× while maintaining real-time rendering (Navaneet et al., 2023).
Limitations include sensitivity to segmentation and gradient estimation quality (which can misallocate primitives), minor metadata and tuning overhead for learned bitwidths, and the potential collapse of important primitives when balancing rate penalties at ultra-low bitrates. Effective regularization and dataset-tuned penalty weights are essential.
A plausible implication is that further integration of learned, structure-adaptive allocation and joint geometric–appearance codebooks will drive future advances in visual quantization, especially for multimodal generative architectures (Shi et al., 19 Aug 2025).