Continuous Visual Autoregressive Generation via Score Maximization (2505.07812v1)

Published 12 May 2025 in cs.CV

Abstract: Conventional wisdom suggests that autoregressive models are used to process discrete data. When applied to continuous modalities such as visual data, Visual AutoRegressive modeling (VAR) typically resorts to quantization-based approaches to cast the data into a discrete space, which can introduce significant information loss. To tackle this issue, we introduce a Continuous VAR framework that enables direct visual autoregressive generation without vector quantization. The underlying theoretical foundation is strictly proper scoring rules, which provide powerful statistical tools capable of evaluating how well a generative model approximates the true distribution. Within this framework, all we need is to select a strictly proper score and set it as the training objective to optimize. We primarily explore a class of training objectives based on the energy score, which is likelihood-free and thus overcomes the difficulty of making probabilistic predictions in the continuous space. Previous efforts on continuous autoregressive generation, such as GIVT and diffusion loss, can also be derived from our framework using other strictly proper scores. Source code: https://github.com/shaochenze/EAR.

Summary

The paper introduces a continuous VAR framework that leverages strictly proper scoring rules and energy-based objectives to generate continuous visual data.
It replaces discrete quantization with an MLP-based energy Transformer, dramatically reducing inference latency while maintaining fidelity.
Experimental results demonstrate competitive FID scores and enhanced efficiency, highlighting the model's scalability and practical impact.

Continuous Visual Autoregressive Generation via Score Maximization

This paper introduces a novel approach to visual autoregressive generation by proposing a Continuous VAR framework that bypasses the limitations of vector quantization typically required for visual data. By utilizing strictly proper scoring rules, the framework harnesses statistical tools to optimize generative models on continuous variables, specifically through energy-based objectives.

Limitations of Discrete Autoregressive Models

Traditionally, autoregressive models rely on quantization to transform continuous visual data into discrete tokens, which are manageable within the model's finite vocabulary. However, this quantization process incurs substantial information loss, severely impacting generation quality due to the mismatch between the true continuous nature of the visual data and its discrete representation. This has led to a growing interest in directly modeling continuous modalities without the need for quantization, yet achieving this efficiently remains challenging.

The Continuous VAR Framework

Theoretical Foundation

The paper builds on the theory of strictly proper scoring rules that gauge the alignment between predicted and true data distributions, ensuring that models are penalized for deviations from the correct predictions. These scoring rules have been successfully applied in various domains, such as weather prediction and statistical evaluations. Their application to autoregressive models allows the training objectives to be based on scores like the energy score, which are likelihood-free and suitable for continuous spaces.

Application of Score Maximization

The mechanism proposed involves using the energy score as a training objective, facilitating the autoregressive model to produce predictions closely resembling the true distribution. This approach circumvents the necessity of explicitly estimating likelihoods, which is often infeasible in continuous spaces due to the absence of a finite vocabulary. Instead, the approach leverages unbiased estimators, which retain predictive fidelity while optimizing sampling-based model training.

Practical Implementation Details

Energy Transformer Architecture

The energy Transformer employs an analogous architecture to discrete Transformers but replaces the softmax layer with an MLP generator that samples predictions. Unlike diffusion-based models, this architecture facilitates efficient single-pass generation, significantly reducing inference latency. The MLP generator takes random noise inputs to approximate the complex distribution of continuous tokens, similar to GAN architectures.

Techniques for Enhancing Model Performance

Several techniques are proposed to further refine model outputs:

Temperature Scaling: Both training and inference benefit from temperature-adjusted scoring rules, enhancing precision without undermining sample diversity.
Classifier-Free Guidance: Improves conditional generation quality by dynamically integrating masked conditions during inference.
Masked Autoregressive Generation: This technique allows bidirectional attention, thereby enhancing representation learning and overall fidelity compared to causal models.

Experimental Evaluation

Experimental results reveal that EAR significantly outperforms baseline models in both quality and efficiency, underscoring the scalability potential across different system dimensions. Notably, EAR achieves competitive FID scores with markedly reduced parameter sizes and inference times compared to diffusion-based methodologies.

Latency and Efficiency Comparison

In benchmarking against diffusion-based models, EAR demonstrates superior inference efficiency, generating high-quality images in a fraction of the time required by competing models. This efficiency aligns with the expressive capacity of the model architecture, which flexibly adapts to the inherent complexity of visual data without stringent constraints imposed by predefined distributions.

Conclusion

This work establishes a versatile foundation for autoregressive modeling of continuous data by integrating strictly proper scoring rules as loss functions. Through exploring energy-based scoring objectives, the paper introduces a highly expressive generation architecture that efficiently produces high-quality visual outputs. The implications extend to potential future work in optimizing model architectures, exploring alternative scoring criteria, and applying these techniques across various continuous domains, including video and audio.

The Continuous VAR framework offers a promising path forward in visual generation, emphasizing the importance and practicality of model expressiveness and adherence to theoretically grounded training objectives.