ARFlow: Autoregressive Flow with Hybrid Linear Attention

Published 27 Jan 2025 in cs.CV | (2501.16085v2)

Abstract: Flow models are effective at progressively generating realistic images, but they generally struggle to capture long-range dependencies during the generation process as they compress all the information from previous time steps into a single corrupted image. To address this limitation, we propose integrating autoregressive modeling -- known for its excellence in modeling complex, high-dimensional joint probability distributions -- into flow models. During training, at each step, we construct causally-ordered sequences by sampling multiple images from the same semantic category and applying different levels of noise, where images with higher noise levels serve as causal predecessors to those with lower noise levels. This design enables the model to learn broader category-level variations while maintaining proper causal relationships in the flow process. During generation, the model autoregressively conditions the previously generated images from earlier denoising steps, forming a contextual and coherent generation trajectory. Additionally, we design a customized hybrid linear attention mechanism tailored to our modeling approach to enhance computational efficiency. Our approach, termed ARFlow, achieves 6.63 FID scores on ImageNet at 256 * 256 without classifier-free guidance, reaching 1.96 FID with classifier-free guidance 1.5, outperforming the previous flow-based model SiT's 2.06 FID. Extensive ablation studies demonstrate the effectiveness of our modeling strategy and chunk-wise attention design.

Abstract PDF Upgrade to Chat

Authors (8)

Summary

The paper introduces ARFlow, which integrates autoregressive modeling into flow models to capture long-range dependencies for coherent image generation.
It employs a hybrid linear attention mechanism to efficiently process causal sequences, proving crucial for maintaining performance with longer sequences.
Experimental results on ImageNet show ARFlow achieving significantly improved FID scores (4.34 vs 9.17) over previous state-of-the-art models.

ARFlow: Autoregressive Flow with Hybrid Linear Attention

Introduction

Flow models have emerged as competitive solutions in the field of image generation due to their straight-line trajectories between data and noise, which offer superior training and inference efficiency compared to diffusion models. However, they are often limited by their inability to capture long-range dependencies, as they must process generation one corrupted image at a time. To mitigate these limitations, the authors introduced ARFlow, a novel framework that integrates autoregressive modeling patterns into flow models to enhance their ability to maintain consistency across generation steps.

Methodology

The ARFlow framework leverages autoregressive modeling to address the shortcoming of traditional flow models in capturing long-range dependencies. This involves using autoregressive principles to encode a sequence of images from the same semantic category with varying levels of noise, thus allowing the model to learn and maintain broader category-level variations while preserving causal relationships.

Figure 1: Overview of ARFlow's training and generation stages illustrating the hybrid linear flow transformer.

ARFlow Framework

Training Sequence Construction: ARFlow creates causally-ordered sequences by sampling multiple images with varying noise levels from the same semantic category. This approach facilitates the learning of category-level variations while maintaining causal flow relationships.
Hybrid Linear Attention Mechanism: A key innovation in ARFlow is the introduction of a hybrid linear attention mechanism. This custom mechanism enhances computational efficiency by utilizing linear attention to maintain long-range dependencies while leveraging chunk-wise processing for efficiency.
Autoregressive Conditioning: During generation, ARFlow conditions on images generated in previous steps, promoting a coherent sequence generation.

Experimental Results

ARFlow was evaluated on the ImageNet dataset with models of varying sizes. The framework achieved significantly improved FID scores, surpassing previous state-of-the-art flow models such as SiT. For instance, ARFlow with classifier-free guidance set to 1.5 achieved an FID of 4.34 compared to SiT's 9.17. This showcases ARFlow's capacity for generating high-quality images efficiently.

Performance Metrics

FID Improvements: ARFlow consistently achieved lower FID scores across all model sizes and configurations, indicating better generative performance.
Impact of Sequence Length: Increasing sequence length improved performance, highlighting ARFlow's effective modeling of long-range dependencies.
Figure 2: Visual comparison of training loss between different scale SiT and ARFlow revealing ARFlow's superior training dynamics.

Ablation Studies

The studies conducted reveal several insights:

Sequence Length: Longer training sequences consistently improved generation quality.
Efficiency of Hybrid Attention: The hybrid attention mechanism proved crucial for processing long autoregressive sequences efficiently, without compromising on speed or quality.
Effect of Cached States: The presence of cached states was fundamental, as removing them resulted in significant performance degradation, confirming their role in maintaining trajectory consistency.
Figure 3: Training loss comparison for different sequence lengths, demonstrating consistency improvements with longer sequences.

Conclusion

The ARFlow framework represents a significant advancement in autoregressive flow modeling by integrating a hybrid linear attention mechanism and leveraging autoregressive principles. These innovations address the inherent limitations of traditional flow models, facilitating more coherent image generation with improved efficiency. The implications of ARFlow extend beyond image generation, offering potential applications in other domains requiring sequential data generation and consistency. Future research could explore further enhancements in attention mechanisms and apply ARFlow strategies to different generative models and tasks.