ZigMa: A DiT-style Zigzag Mamba Diffusion Model (2403.13802v3)

Published 20 Mar 2024 in cs.CV, cs.AI, cs.CL, and cs.LG

Abstract: The diffusion model has long been plagued by scalability and quadratic complexity issues, especially within transformer-based structures. In this study, we aim to leverage the long sequence modeling capability of a State-Space Model called Mamba to extend its applicability to visual data generation. Firstly, we identify a critical oversight in most current Mamba-based vision methods, namely the lack of consideration for spatial continuity in the scan scheme of Mamba. Secondly, building upon this insight, we introduce a simple, plug-and-play, zero-parameter method named Zigzag Mamba, which outperforms Mamba-based baselines and demonstrates improved speed and memory utilization compared to transformer-based baselines. Lastly, we integrate Zigzag Mamba with the Stochastic Interpolant framework to investigate the scalability of the model on large-resolution visual datasets, such as FacesHQ $1024\times 1024$ and UCF101, MultiModal-CelebA-HQ, and MS COCO $256\times 256$ . Code will be released at https://taohu.me/zigma/

References (2)

Citations (19)

View on Semantic Scholar

Summary

The paper proposes a novel zigzag scanning mechanism to enhance spatial continuity in state-space models, enabling efficient 2D image and 3D video synthesis.
It employs stochastic interpolants and spatial-temporal factorization to achieve robust performance, outperforming traditional methods on benchmark datasets.
The design integrates strong inductive biases, ensuring enhanced scalability and high-fidelity synthesis in generative modeling applications.

Zigzag Mamba Diffusion Model: Advancing the Scalability of State-Space Models in Generative Systems

Introduction to the Zigzag Mamba (ZigMa) Approach

The Zigzag Mamba (ZigMa) Diffusion Model introduces an innovative approach to enhancing the capability of State-Space Models (SSMs) for generative applications, specifically in image and video synthesis. By addressing the challenge of Spatial Continuity in the extension of Mamba, a type of SSM, the paper proposes ZigMa for 2D image and 3D video modeling. This method capitalizes on the inherent structure of visual data to improve model accuracy and efficiency. Furthermore, ZigMa's scalability is showcased in its application to larger-scale image data and videos, a step further explored through Stochastic Interpolants in complex scenarios.

Background and Related Work

Mamba Overview: Mamba enhances SSMs by introducing flexibility in parameter time variance and efficient parallel processing. While it excels in 1D sequence modeling, its extension to 2D and 3D domains poses challenges due to the necessity of capturing spatial and temporal continuities. Despite previous attempts, such as flattening 2D tokens or employing multidirectional scans, these approaches either overlook spatial continuity or incur additional computational costs.

Transformers in Diffusion Models: The adaptability and modality-agnostic nature of transformer-based structures inspire the design of ZigMa. However, transformers are hindered by quadratic complexity concerning attention mechanisms. In contrast, ZigMa, while inspired by transformer scalability, incorporates Mamba's linear complexity SSM to tackle this issue effectively.

The Zigzag Mamba Mechanism

Zigzag Scanning for Spatial Continuity: ZigMa introduces a zigzag scanning mechanism to maintain spatial continuity in 2D image processing. This method involves arranging input features in a zigzag pattern before processing them through the Mamba block, thereby capturing the spatial relations between image pixels more effectively.

Extension to 3D Modeling and Temporal Factorization: For video data, ZigMa is extended through spatial and temporal factorization, allowing for separate and efficient modeling of space and time. The paper showcases the adaptable nature of ZigMa in handling complex 3D sequences.

Diffusion Framework Utilization: ZigMa is developed within the Stochastic Interpolant framework, allowing it to fit within various generative models' spectrum. This theoretically grounded approach ensures that ZigMa can be applied across a wide array of scenarios, from high-resolution images to detailed video sequences.

Empirical Results and Analysis

Efficiency in Spatial Continuity: ZigMa demonstrates significant improvements in modeling efficiency by exploiting spatial continuity through various zigzag patterns. This design choice not only improves the generation quality but also enhances the computational speed and model scalability.

Robust Performance Across Modalities: Tested on FacesHQ for images and UCF101 for videos, ZigMa outperforms existing Mamba-based models and traditional approaches. Its capability to adapt scanning patterns across layers further confirms its robustness in handling complex, high-resolution datasets.

Inductive Bias Incorporation: ZigMa's deliberate design around spatial and temporal continuities showcases the power of incorporating strong inductive biases related to the nature of visual data. This approach aids in capturing the intricate details necessary for high-fidelity image and video synthesis.

Conclusions and Future Directions

The Zigzag Mamba Diffusion Model emerges as a significant advancement in the scalability and applicability of State-Space Models for generative tasks. By addressing spatial continuity and extending its methodology to 3D data, ZigMa sets the groundwork for future explorations in detailed image and video generation. With its impressive performance across various datasets and resolutions, ZigMa beckons further research into exploiting the full potential of SSMs within generative models, promising advancements in efficiency, scalability, and fidelity in synthetic visual content generation.

PDF Markdown

Related Papers

A Survey on Visual Mamba (2024)
Visual Mamba: A Survey and New Outlooks (2024)
Vision Mamba: A Comprehensive Survey and Taxonomy (2024)
Soft Masked Mamba Diffusion Model for CT to MRI Conversion (2024)
MambaVision: A Hybrid Mamba-Transformer Vision Backbone (2024)

Tweets

https://twitter.com/_akhaliq/status/1770668624392421512

https://twitter.com/fly51fly/status/1771892168657178726

https://twitter.com/woojinrad/status/1772022086661288008

Reddit

ZigMa: Zigzag Mamba Diffusion Model (2 points, 0 comments)