ArcVQ-VAE: A Spherical Vector Quantization Framework with ArcCosine Additive Margin

Published 13 May 2026 in cs.CV, cs.AI, and cs.LG | (2605.13517v1)

Abstract: Vector Quantized Variational Autoencoder (VQ-VAE) has become a fundamental framework for learning discrete representations in image modeling. However, VQ-VAE models must tokenize entire images using a finite set of codebook vectors, and this capacity limitation restricts their ability to capture rich and diverse representations. In this paper, we propose ArcCosine Additive Margin VQ-VAE (ArcVQ-VAE), a novel vector quantization framework that introduces a spherical angular-margin prior (SAMP) for the codebook of a conventional VQ-VAE. The proposed SAMP consists of Ball-Bounded Norm Regularization, which constrains all codebook vectors within a time-dependent Euclidean ball, and ArcCosine Additive Margin Loss, which encourages greater angular separability among latent vectors. This formulation promotes more discriminative and uniformly dispersed latent representations within the constrained space, thereby improving effective latent-space coverage and leading to improved codebook utilization. Experimental results on standard image reconstruction and generation tasks show that ArcVQ-VAE achieves competitive performance against baseline models in terms of reconstruction accuracy, representation diversity, and sample quality. The code is available at: https://github.com/goals4292/ArcVQ-VAE

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper introduces a spherical angular-margin framework that mitigates codebook collapse through norm regularization and angular margin techniques.
It employs ball-bounded norm regularization and ArcCosine loss to enforce geometric constraints, achieving superior reconstruction and generative performance across datasets.
Experimental results show improved metrics like FID and PSNR with enhanced codebook utilization, underscoring its impact on discrete representation learning.

ArcVQ-VAE: A Spherical Vector Quantization Framework with ArcCosine Additive Margin

Introduction and Motivation

ArcVQ-VAE addresses the persistent limitations of Vector Quantized Variational Autoencoders (VQ-VAE) in discrete representation learning for image modeling. Standard VQ-VAE frameworks, by tokenizing images with a finite codebook, suffer from restricted expressive capacity and codebook collapse, where only a small fraction of codebook entries become active while the remainder remain unused. This geometric imbalance undermines representational richness and limits codebook utilization, adversely affecting reconstruction fidelity, generative quality, and the overall robustness of learned representations.

To resolve these issues, ArcVQ-VAE introduces a Spherical Angular-Margin Prior (SAMP) by integrating Ball-Bounded Norm Regularization and ArcCosine Additive Margin Loss. These innovations directly impose geometric constraints and angular separability on codebook vectors and latent representations, enabling more uniform dispersion of codebook entries in the latent space and mitigating collapse mechanisms without additional network modules or nontrivial computational overhead.

Geometric Regularization: Ball-Bounded Norm and ArcCosine Margin

ArcVQ-VAE leverages Ball-Bounded Norm Regularization to constrain codebook vector norms within a dynamically relaxed Euclidean ball defined by a time-dependent upper bound $M(t) = \exp(\alpha t)$ . Early in training, stringent norm bounds ensure balanced codebook activation, while gradual relaxation affords adaptive representational flexibility. This step restricts the magnitude of frequently used codebook vectors and prevents over-concentration, thus promoting equitable codebook competition and wider latent-space coverage.

Building on hyperspherical learning principles and angular-margin losses as used in metric learning, ArcCosine Additive Margin Loss (ArcLoss) is introduced during quantization. Latent vectors and codebook entries are $\ell_2$ -normalized, transforming nearest-neighbor assignment into angular-similarity matching on the hypersphere. The additive angular margin $m$ forces latent vectors to be more tightly aligned with their associated codebook vectors while maximizing angular separation from others. This configuration cultivates discriminative and well-partitioned latent neighborhoods, substantially boosting codebook utilization and empirical coverage.

Experimental Design and Strong Results

ArcVQ-VAE is evaluated on MNIST, CIFAR-10, FFHQ, and ImageNet-1K, using standard VQ-VAE and VQGAN baselines as well as recently proposed variants (HVQ-VAE, SQ-VAE, CVQ-VAE, VQGAN-LC, MoVQ). Across all datasets and tasks, ArcVQ-VAE achieves:

Superior or competitive reconstruction accuracy (PSNR, SSIM), perceptual quality (LPIPS, FID), and codebook utilization rates compared to prior methods.
On ImageNet-1K, with only $K = 1024$ codebook entries, ArcVQ-VAE attains high reconstruction performance and sample fidelity, with FID improving upon or closely matching models using vastly larger codebooks or auxiliary components.
On CIFAR-10, ArcVQ-VAE with PixelCNN generative prior yields a FID score improvement from 57.36 (VQ-VAE baseline) to 43.13, supporting its tokenizer-side transferability.

Qualitative analyses highlight ArcVQ-VAE's ability to preserve local details and structural contours more faithfully than baseline VQGANs. Visualizations of codebook vector distributions illustrate more uniform dispersion and activation, confirming reduced collapse and geometric redundancy.

Notably, ablation studies confirm that Ball-Bounded Norm Regularization and ArcLoss are each incrementally beneficial, but their combination realizes the most substantial gains. Codebook utility is robust to moderate variations in hyperparameters (top- $k$ , $m$ , $s$ ), and time-dependent norm schedules outperform constant bounds.

Theoretical and Practical Implications

ArcVQ-VAE exemplifies a geometry-aware approach to vector quantization, reconciling discrete tokenization with hyperspherical uniformity and angular margin separation. This framework:

Facilitates more efficient codebook usage under stringent budget constraints, enabling state-of-the-art discrete representation learning without reliance on auxiliary pretrained models, clustering mechanisms, or codebook scaling.
Suggests that geometric regularization—rather than mere codebook utilization frequency or stochastic selection—can directly improve both reconstruction and generative performance, especially in high-diversity datasets.
Demonstrates direct transferability to alternative generative priors (e.g., PixelCNN), indicating that improved latent structure is beneficial beyond diffusion models.

These contributions open avenues to further integrate hyperspherical learning and angular margin techniques into VQ-based model architectures. While the empirical gains are robust, the theoretical connection between geometric regularization and downstream generative or semantic quality merits deeper exploration, as does direct semantic evaluation of codebook entries.

Future Directions

While ArcVQ-VAE achieves consistent empirical improvements, future research should investigate:

Adaptive norm-bound and loss-weight schedules that respond to training dynamics or dataset complexity.
Explicit semantic partitioning in codebook vectors, potentially utilizing external supervision or hierarchical constraints.
Theoretical analysis of the gradient flow with STE under geometric losses and the precise role of margin regularization in mitigating codebook collapse.
Broader transferability to text, audio, and multimodal VQ-based representation learning.

Conclusion

ArcVQ-VAE advances VQ-based representation learning by enforcing a spherical angular-margin prior—combining norm constraints and angular margin loss—to rectify geometric imbalances and codebook collapse. The resulting framework achieves superior codebook utilization, reconstruction, and generative quality, and is robust across datasets and generative priors. This work establishes a principled foundation for geometry-aware vector quantization, with practical benefits for image compression, generation, and efficient representation learning, and invites further exploration of hyperspherical regularization in discrete latent variable modeling.

Markdown Report Issue