Scalable Autoregressive Image Generation with Mamba (2408.12245v4)

Published 22 Aug 2024 in cs.CV

Abstract: We introduce AiM, an autoregressive (AR) image generative model based on Mamba architecture. AiM employs Mamba, a novel state-space model characterized by its exceptional performance for long-sequence modeling with linear time complexity, to supplant the commonly utilized Transformers in AR image generation models, aiming to achieve both superior generation quality and enhanced inference speed. Unlike existing methods that adapt Mamba to handle two-dimensional signals via multi-directional scan, AiM directly utilizes the next-token prediction paradigm for autoregressive image generation. This approach circumvents the need for extensive modifications to enable Mamba to learn 2D spatial representations. By implementing straightforward yet strategically targeted modifications for visual generative tasks, we preserve Mamba's core structure, fully exploiting its efficient long-sequence modeling capabilities and scalability. We provide AiM models in various scales, with parameter counts ranging from 148M to 1.3B. On the ImageNet1K 256*256 benchmark, our best AiM model achieves a FID of 2.21, surpassing all existing AR models of comparable parameter counts and demonstrating significant competitiveness against diffusion models, with 2 to 10 times faster inference speed. Code is available at https://github.com/hp-l33/AiM

Citations (7)

View on Semantic Scholar

Summary

The paper introduces AiM, an autoregressive model that employs the Mamba state-space framework to enhance image quality and inference speed.
It integrates architectural innovations such as absolute positional encoding and group adaptive layer normalization to effectively manage spatial challenges in image data.
Evaluated on ImageNet1K, the AiM model with 1.3B parameters achieves a FID of 2.21, underscoring its scalability and practical applicability.

Scalable Autoregressive Image Generation with Mamba: An Expert Overview

The paper "Scalable Autoregressive Image Generation with Mamba" introduces AiM, an autoregressive (AR) image generation model leveraging Mamba, a novel state-space model (SSM). The authors propose AiM as an alternative to Transformer-based AR models for superior generation quality and enhanced inference speed. This research aims to exploit Mamba's efficient sequence modeling capabilities while addressing specific challenges in visual generative tasks.

Key Contributions

Novel Application of Mamba in AR Image Generation: AiM utilizes Mamba, which is traditionally employed for long-sequence modeling due to its linear time complexity. Unlike existing methods that modify Mamba for two-dimensional signals, AiM adheres to the next-token prediction paradigm. This facilitates the model's deployment without extensive architectural changes, maintaining the core strengths of Mamba.
Adapting Architectural Enhancements:
- Positional Encoding (PE): To tackle the challenge of accurately modeling the spatial properties inherent in image data, the authors introduce absolute positional encoding. This addition helps overcome issues like "mirrored artifacts" often seen in generated images without positional awareness.
- Group Adaptive Layer Normalization (adaLN-Group): The authors propose a hierarchical approach to adaLN by grouping layers, thus optimizing the trade-off between performance and parameter count. This method generalizes both the global adaLN-single and layer-specific adaLN approaches, offering a balanced and efficient alternative for class-conditional generation.
Evaluation and Performance Metrics: On the ImageNet1K 256×256 benchmark, AiM achieves a Fréchet Inception Distance (FID) of 2.21 with the 1.3B parameter model, surpassing other AR models of similar scales while demonstrating competitive performance against state-of-the-art diffusion models. Notably, AiM provides significant inference speed advantages, further establishing its practical applicability.

Methodological Insights

Mamba Framework for Sequence Modeling

Mamba is a state-space model designed for efficient long-sequence modeling with linear complexity. It processes sequences by discretizing continuous parameters and solving ODEs recurrently. This methodology aligns well with autoregressive LLMs, where sequential token prediction is crucial. By extending Mamba into image generation, the authors leverage its strengths in sequence tasks while addressing the unique spatial requirements of image data.

Adapting to Visual Data

The research highlights two major architectural enhancements:

Positional Encoding (PE): To handle images as sequences effectively, PE is introduced, which assists in recognizing spatial positions within flattened image sequences. This addition mitigates issues like mirrored artifacts that arise when the model misinterprets spatial transitions.
Group Adaptive Layer Normalization (adaLN-Group): By partitioning layers into groups, each with shared parameters regressed from conditional information, adaLN-Group balances parameter efficiency and performance. This hierarchical approach generalizes previous methods, ensuring adaptability across different model scales.

Experimental Evaluation and Results

AiM models, ranging from 148M to 1.3B parameters, are rigorously evaluated on the ImageNet1K 256×256 dataset. The results indicate that larger models and longer training epochs significantly enhance image generation quality, with smaller models like AiM-B achieving a FID of 3.5. The scalability of AiM is evident from its performance improvements with increased model size and training compute, suggesting that further scaling can yield even more substantial gains.

Implications and Future Directions

The implications of this research are substantial for the field of autoregressive image generation. AiM sets a precedent for leveraging state-space models like Mamba in visual tasks, demonstrating that architectural innovations can significantly advance both the efficiency and quality of image generation models. This work emphasizes the potential for integrating Mamba's efficient sequence modeling with advanced AR techniques, paving the way for broader applications in visual generative tasks.

Future research could explore:

Text-to-Image Generation: Extending AiM to handle more complex conditional inputs, such as textual descriptions, could unify visual and LLMs more effectively.
Further Efficiency Enhancements: Investigating additional methods to streamline autoregressive models, potentially integrating more scalable and efficient architectures.

The paper provides a compelling case for the practical and theoretical advancements in AR image generation, suggesting that the integration of state-space models presents a viable path forward for highly efficient and scalable generative models.

PDF Markdown

Related Papers

GitHub

GitHub - hp-l33/AiM: Official PyTorch Implementation of "Scalable Autoregressive Image Generation with Mamba" (108 stars)

Tweets

https://twitter.com/_akhaliq/status/1826813852668846519

https://twitter.com/miru_why/status/1826796505157829105

https://twitter.com/UdariMadhu/status/1827782294700192239

https://twitter.com/arXivGPT/status/1827429732939657625

https://twitter.com/arxivsanitybot/status/1827892602261156265

https://twitter.com/arXivGPT/status/1827796276705952128