Efficient Generative Modeling with Residual Vector Quantization-Based Tokens (2412.10208v3)

Published 13 Dec 2024 in cs.LG

Abstract: We introduce ResGen, an efficient Residual Vector Quantization (RVQ)-based generative model for high-fidelity generation with fast sampling. RVQ improves data fidelity by increasing the number of quantization steps, referred to as depth, but deeper quantization typically increases inference steps in generative models. To address this, ResGen directly predicts the vector embedding of collective tokens rather than individual ones, ensuring that inference steps remain independent of RVQ depth. Additionally, we formulate token masking and multi-token prediction within a probabilistic framework using discrete diffusion and variational inference. We validate the efficacy and generalizability of the proposed method on two challenging tasks across different modalities: conditional image generation on ImageNet 256x256 and zero-shot text-to-speech synthesis. Experimental results demonstrate that ResGen outperforms autoregressive counterparts in both tasks, delivering superior performance without compromising sampling speed. Furthermore, as we scale the depth of RVQ, our generative models exhibit enhanced generation fidelity or faster sampling speeds compared to similarly sized baseline models.

Summary

The paper introduces ResGen, which directly predicts cumulative vector embeddings of masked RVQ tokens to decouple sampling complexity from sequence length and depth.
ResGen employs a probabilistic discrete diffusion model with a hierarchical, masked prediction strategy using a mixture-of-Gaussians to enhance reconstruction fidelity.
Experimental results on ImageNet and text-to-speech demonstrate competitive FID, WER, and CER scores with faster sampling and scalable performance.

Efficient Generative Modeling with Residual Vector Quantization-Based Tokens

Introduction

This paper introduces ResGen, a discrete diffusion model leveraging Residual Vector Quantization (RVQ) for efficient, high-fidelity generative modeling. The approach is motivated by the need to balance data fidelity and computational efficiency in generative models, particularly when scaling to high-resolution images and long audio sequences. RVQ enables compact, hierarchical token representations, but prior generative models suffer from increased sampling complexity as token depth grows. ResGen addresses this by directly predicting cumulative vector embeddings of masked tokens, decoupling sampling complexity from both sequence length and depth, and formulating the process within a principled probabilistic framework.

RVQ Tokenization and Masked Prediction

RVQ extends VQ-VAE by iteratively quantizing residuals, producing token sequences with greater depth but reduced length, thus maintaining high reconstruction fidelity. The hierarchical structure of RVQ tokens presents challenges for generative modeling, especially for autoregressive models whose sampling steps scale with the product of sequence length and depth.

ResGen introduces a masking strategy tailored for RVQ tokens, progressively masking tokens from the highest quantization layers (fine details) to the lowest (coarse features). During training, the model predicts the sum of masked embeddings at each position, rather than individual tokens, using a mixture of Gaussians to model the latent distribution. This multi-token prediction avoids conditional independence assumptions along depth and aligns with the RVQ dequantization process.

Figure 1: Overview of forward masking and reverse unmasking in ResGen, illustrating coarse-to-fine token filling via cumulative RVQ embedding prediction.

Probabilistic Formulation: Discrete Diffusion and Variational Inference

The token masking and prediction process is cast as a discrete diffusion model. The forward process incrementally masks tokens without replacement, modeled as draws from a multivariate hypergeometric distribution. The reverse process reconstructs the original sequence by iteratively predicting cumulative embeddings and quantizing them back to tokens.

The training objective is derived from the variational lower bound of the data log-likelihood, with losses corresponding to prior, diffusion, and reconstruction terms. The multi-token prediction is formalized via variational inference, focusing on the likelihood of cumulative embeddings conditioned on partially masked sequences.

Implementation Details

ResGen is implemented using a transformer backbone similar to DiT-XL, with modifications for RVQ token embedding and mixture-of-Gaussians output heads. Training involves a masking schedule that distributes masked tokens across positions and depths, and the mixture-of-Gaussians objective is decomposed into classification and regression losses to encourage diverse component usage. Low-rank projection is employed to scale the number of mixture components efficiently.

Sampling proceeds iteratively, with confidence-based unmasking guided by log probabilities derived from the mixture model. Temperature scaling and top-p sampling are used to balance diversity and fidelity.

Experimental Results

ResGen is evaluated on conditional image generation (ImageNet 256×256) and zero-shot text-to-speech synthesis. In both domains, ResGen demonstrates superior or comparable performance to autoregressive and non-autoregressive baselines, with strong trade-offs between quality, speed, and memory efficiency.

ImageNet 256×256: ResGen-d16 achieves FID=1.95 with classifier-free guidance, closely matching MAR-L (FID=1.78) but with significantly faster sampling and higher maximum batch size (1915 vs. 1167).
Text-to-Speech: In cross-sentence tasks, ResGen attains the lowest WER (1.72) and CER (0.46), outperforming all baselines. In continuation tasks, it ranks second in WER and CER while requiring the fewest inference steps (16 vs. 25 for DiTTo-en-XL).
Figure 2: Trade-off between sampling speed and generation quality for ResGen and baselines; right panel shows maximum batch size per model.

Figure 3: VAR-d30 model comparison (FID=1.92) for reference against ResGen.

Figure 4: Randomly generated 256×256 samples by ResGen trained on ImageNet, demonstrating high-fidelity synthesis.

Ablation and Sampling Analysis

Ablation studies reveal that increasing sampling steps improves generation quality, and temperature/top-p scaling can be tuned for optimal diversity and fidelity. The model achieves high-quality generation with a relatively small number of iterations, attributed to the unmasking process being easier than denoising in conventional diffusion models.

Implications and Future Directions

ResGen's decoupling of sampling complexity from token depth and sequence length enables scalable, efficient generative modeling for high-resolution images and long audio sequences. The approach is generalizable across modalities and demonstrates strong memory efficiency, making it suitable for deployment in resource-constrained environments.

Potential future directions include:

Incorporating key-value caching in transformers to further accelerate sampling.
Extending the framework to support alternative quantization methods such as Finite Scalar Quantization (FSQ).
Providing a formal theoretical justification for the observed efficiency in low-step inference.

Conclusion

ResGen presents an efficient, principled approach to generative modeling with RVQ-based tokens, achieving high-fidelity synthesis without compromising speed or scalability. The direct prediction of cumulative embeddings, probabilistic formulation, and empirical results establish ResGen as a robust solution for discrete generative modeling in both vision and audio domains. The methodology opens avenues for further research in quantization strategies, transformer optimization, and theoretical analysis of discrete diffusion processes.