Randomized Autoregressive Visual Generation (2411.00776v1)

Published 1 Nov 2024 in cs.CV

Abstract: This paper presents Randomized AutoRegressive modeling (RAR) for visual generation, which sets a new state-of-the-art performance on the image generation task while maintaining full compatibility with LLMing frameworks. The proposed RAR is simple: during a standard autoregressive training process with a next-token prediction objective, the input sequence-typically ordered in raster form-is randomly permuted into different factorization orders with a probability r, where r starts at 1 and linearly decays to 0 over the course of training. This annealing training strategy enables the model to learn to maximize the expected likelihood over all factorization orders and thus effectively improve the model's capability of modeling bidirectional contexts. Importantly, RAR preserves the integrity of the autoregressive modeling framework, ensuring full compatibility with LLMing while significantly improving performance in image generation. On the ImageNet-256 benchmark, RAR achieves an FID score of 1.48, not only surpassing prior state-of-the-art autoregressive image generators but also outperforming leading diffusion-based and masked transformer-based methods. Code and models will be made available at https://github.com/bytedance/1d-tokenizer

Citations (5)

View on Semantic Scholar

Summary

The paper introduces a novel RAR approach that integrates randomness annealing to improve bidirectional context in autoregressive visual generation.
The method employs a permutation probability that linearly decays, transitioning from random token ordering to a deterministic raster order during training.
RAR achieves state-of-the-art results on ImageNet-256 with an FID score of 1.48, outperforming larger models while using fewer parameters.

Insightful Overview of "Randomized Autoregressive Visual Generation"

The paper presents a novel approach named Randomized AutoRegressive modeling (RAR) for addressing challenges in visual generation with autoregressive models. This method augments the conventional autoregressive framework by introducing a randomness annealing training strategy, effectively enhancing bidirectional context modeling without deviating from the core autoregressive principles. The procedure incorporates random permutations to reorder token sequences during training, which naturally evolves back to a standard raster order as training progresses. This strategy capitalizes on the strengths of autoregressive and bidirectional modeling paradigms.

Methodology and Technical Contributions

ROR provides full compatibility with existing LLMing frameworks and demonstrates significant improvements over prior approaches in autoregressive image generation. The central methodological advancement of this work is the incorporation of permutation probability r, which dictates the ratio of training instances using random permutations. Initially set to 1, r linearly decreases to 0, reverting sequences to a deterministic raster order by the end of the training. This annealing strategy maximizes the model's likelihood over all potential factorization orders, refining its capacity to model bidirectional contexts without infringing on the autoregressive framework.

Experimental Results and Benchmark Performance

The paper positions RAR against existing state-of-the-art methods through rigorous experimentation. On the ImageNet-256 benchmark, RAR achieves an FID score of 1.48, outperforming contemporary autoregressive, diffusion, and masked transformer models. Notably, the efficiency of RAR is evident as RAR-B, featuring just 261 million parameters, outstrips methods like LlamaGen-XXL and Open-MAGVIT2-XL in performance metrics, despite its smaller scale. The reported results underscore RAR's potential in autoregressive visual generation, marking a noteworthy advancement through its effective yet uncomplicated training paradigm.

Theoretical and Practical Implications

Theoretically, RAR exemplifies how bidirectional context can be effectively harvested within autoregressive models without introducing substantial modifications that might jeopardize the integrity of LLMing tasks. Practically, this approach paves the way for unified multimodal models that maintain the simplicity of autoregressive formulations while embracing the complexity of visual data. The seamless integration and improved performance metrics make RAR a compelling candidate for further applications across multimodal generation and tasks requiring scalable AI systems.

Future Prospects in AI Development

RAR's framework holds promise for future AI developments focused on unifying diverse modalities. By enhancing autoregressive models' proficiency in handling visual data, RAR could influence the design of models that require adeptness in both language and vision tasks. Subsequently, the methodology opens avenues for deeper exploration into high-efficiency, high-performance models that capitalize on the intrinsic benefits of bidirectional context knowledge.

Conclusion

In sum, the Randomized AutoRegressive modeling strategy enriches the domain of autoregressive visual generation by marrying the adaptability of autoregressive frameworks with sophisticated bidirectional context training. This research accentuates the potential for substantial advancements in AI models accommodating large-scale vision and potentially other modality-specific tasks, without undermining the models' foundational principles.

PDF Markdown

Related Papers

GitHub

GitHub - bytedance/1d-tokenizer: This repo contains the code for 1D tokenizer and generator (513 stars)

YouTube

Show All Videos