M-VAR: Decoupled Scale-wise Autoregressive Modeling for High-Quality Image Generation (2411.10433v1)

Published 15 Nov 2024 in cs.CV

Abstract: There exists recent work in computer vision, named VAR, that proposes a new autoregressive paradigm for image generation. Diverging from the vanilla next-token prediction, VAR structurally reformulates the image generation into a coarse to fine next-scale prediction. In this paper, we show that this scale-wise autoregressive framework can be effectively decoupled into \textit{intra-scale modeling}, which captures local spatial dependencies within each scale, and \textit{inter-scale modeling}, which models cross-scale relationships progressively from coarse-to-fine scales. This decoupling structure allows to rebuild VAR in a more computationally efficient manner. Specifically, for intra-scale modeling -- crucial for generating high-fidelity images -- we retain the original bidirectional self-attention design to ensure comprehensive modeling; for inter-scale modeling, which semantically connects different scales but is computationally intensive, we apply linear-complexity mechanisms like Mamba to substantially reduce computational overhead. We term this new framework M-VAR. Extensive experiments demonstrate that our method outperforms existing models in both image quality and generation speed. For example, our 1.5B model, with fewer parameters and faster inference speed, outperforms the largest VAR-d30-2B. Moreover, our largest model M-VAR-d32 impressively registers 1.78 FID on ImageNet 256$\times$256 and outperforms the prior-art autoregressive models LlamaGen/VAR by 0.4/0.19 and popular diffusion models LDM/DiT by 1.82/0.49, respectively. Code is avaiable at \url{https://github.com/OliverRensu/MVAR}.

Summary

The paper introduces M-VAR, decoupling intra-scale and inter-scale modeling to balance computational efficiency and high-fidelity image generation.
It combines bidirectional self-attention for local interactions with a linear-complexity Mamba mechanism for capturing long-range dependencies.
The model achieves an ImageNet FID score of 1.78 at 256×256 resolution, outperforming previous autoregressive and diffusion-based approaches with fewer parameters.

Decoupled Scale-wise Autoregressive Modeling for High-Quality Image Generation

The paper “M-VAR: Decoupled Scale-Wise Autoregressive Modeling for High-Quality Image Generation” presents an advancement in autoregressive models, specifically tailored for image generation. The authors introduce M-VAR, a new framework that decouples the computation of intra-scale and inter-scale modeling components in autoregressive image models to enhance efficiency and performance.

The crux of the authors' approach is to leverage the traditional strengths of autoregressive models, such as the ability to model intricate spatial dependencies within images, while addressing the computational inefficiencies associated with modeling across multiple scales. Existing autoregressive frameworks like VAR treat the generation process as a sequence of next-scale predictions, which necessitates both intra-scale modeling (capturing local spatial dependencies) and inter-scale modeling (capturing global relationships across scales). However, the authors observe that intra-scale modeling dominates attention but incurs less computational cost compared to inter-scale modeling, which, although crucial, contributes significantly less to the attention distribution and yet demands high computational resources.

By introducing a decoupled computation structure, M-VAR retains traditional bidirectional self-attention for intra-scale interactions and employs a linear-complexity mechanism, Mamba, for inter-scale dependencies. This dual mechanism addresses the inefficiencies identified in the VAR model and strikes a balance between computational cost and modeling fidelity.

Experimental Results and Implications

The authors provide substantial empirical evidence supporting the efficacy of M-VAR. The model achieves a superior Frechet Inception Distance (FID) score of 1.78 on ImageNet at 256×256 resolution, outperforming the previous state-of-the-art autoregressive models, including VAR and LlamaGen, as well as outperforming well-known diffusion models like LDM and DiT. Importantly, M-VAR achieves these results with improved computation efficiency, offering faster inference speeds with fewer parameters than rival models.

The implications of this research are significant for both theoretical development and practical application. The refinement of autoregressive modeling methods to achieve efficient high-fidelity image generation may influence future architectures in computer vision, particularly in domains that prioritize both efficiency and quality. Additionally, the integration of the Mamba mechanism for inter-scale modeling showcases potential advancements in handling long-range dependencies in sequential data, with potential applications beyond image generation.

Future Directions

While M-VAR represents a considerable advancement, further research could explore adapting the model for various generative tasks beyond image synthesis, such as video or 3D content generation. Analyzing the applicability of decoupled scale-wise modeling in these complex domains might unlock new possibilities in generative models’ scalability and effectiveness.

In sum, this paper not only proposes a more efficient framework for autoregressive image generation but also contributes to the ongoing discourse on optimizing neural architectures for improved performance across scales. The results highlight the potential of coupling advanced sequence modeling techniques like Mamba with traditional neural architectures to achieve state-of-the-art outcomes in high-quality image generation.

PDF Markdown

Related Papers

GitHub

GitHub - OliverRensu/MVAR (6 stars)

Tweets

https://twitter.com/cihangxie/status/1858633085635751955

https://twitter.com/natanielruizg/status/1858908589059953048

https://twitter.com/miru_why/status/1858556189304647922