- The paper introduces a continuous VAR framework that leverages strictly proper scoring rules and energy-based objectives to generate continuous visual data.
- It replaces discrete quantization with an MLP-based energy Transformer, dramatically reducing inference latency while maintaining fidelity.
- Experimental results demonstrate competitive FID scores and enhanced efficiency, highlighting the model's scalability and practical impact.
Continuous Visual Autoregressive Generation via Score Maximization
This paper introduces a novel approach to visual autoregressive generation by proposing a Continuous VAR framework that bypasses the limitations of vector quantization typically required for visual data. By utilizing strictly proper scoring rules, the framework harnesses statistical tools to optimize generative models on continuous variables, specifically through energy-based objectives.
Limitations of Discrete Autoregressive Models
Traditionally, autoregressive models rely on quantization to transform continuous visual data into discrete tokens, which are manageable within the model's finite vocabulary. However, this quantization process incurs substantial information loss, severely impacting generation quality due to the mismatch between the true continuous nature of the visual data and its discrete representation. This has led to a growing interest in directly modeling continuous modalities without the need for quantization, yet achieving this efficiently remains challenging.
The Continuous VAR Framework
Theoretical Foundation
The paper builds on the theory of strictly proper scoring rules that gauge the alignment between predicted and true data distributions, ensuring that models are penalized for deviations from the correct predictions. These scoring rules have been successfully applied in various domains, such as weather prediction and statistical evaluations. Their application to autoregressive models allows the training objectives to be based on scores like the energy score, which are likelihood-free and suitable for continuous spaces.
Application of Score Maximization
The mechanism proposed involves using the energy score as a training objective, facilitating the autoregressive model to produce predictions closely resembling the true distribution. This approach circumvents the necessity of explicitly estimating likelihoods, which is often infeasible in continuous spaces due to the absence of a finite vocabulary. Instead, the approach leverages unbiased estimators, which retain predictive fidelity while optimizing sampling-based model training.
Practical Implementation Details
The energy Transformer employs an analogous architecture to discrete Transformers but replaces the softmax layer with an MLP generator that samples predictions. Unlike diffusion-based models, this architecture facilitates efficient single-pass generation, significantly reducing inference latency. The MLP generator takes random noise inputs to approximate the complex distribution of continuous tokens, similar to GAN architectures.
Several techniques are proposed to further refine model outputs:
- Temperature Scaling: Both training and inference benefit from temperature-adjusted scoring rules, enhancing precision without undermining sample diversity.
- Classifier-Free Guidance: Improves conditional generation quality by dynamically integrating masked conditions during inference.
- Masked Autoregressive Generation: This technique allows bidirectional attention, thereby enhancing representation learning and overall fidelity compared to causal models.
Experimental Evaluation
Experimental results reveal that EAR significantly outperforms baseline models in both quality and efficiency, underscoring the scalability potential across different system dimensions. Notably, EAR achieves competitive FID scores with markedly reduced parameter sizes and inference times compared to diffusion-based methodologies.
Latency and Efficiency Comparison
In benchmarking against diffusion-based models, EAR demonstrates superior inference efficiency, generating high-quality images in a fraction of the time required by competing models. This efficiency aligns with the expressive capacity of the model architecture, which flexibly adapts to the inherent complexity of visual data without stringent constraints imposed by predefined distributions.
Conclusion
This work establishes a versatile foundation for autoregressive modeling of continuous data by integrating strictly proper scoring rules as loss functions. Through exploring energy-based scoring objectives, the paper introduces a highly expressive generation architecture that efficiently produces high-quality visual outputs. The implications extend to potential future work in optimizing model architectures, exploring alternative scoring criteria, and applying these techniques across various continuous domains, including video and audio.
The Continuous VAR framework offers a promising path forward in visual generation, emphasizing the importance and practicality of model expressiveness and adherence to theoretically grounded training objectives.