Diffusion-RWKV: Scaling RWKV-Like Architectures for Diffusion Models (2404.04478v1)

Published 6 Apr 2024 in cs.CV

Abstract: Transformers have catalyzed advancements in computer vision and NLP fields. However, substantial computational complexity poses limitations for their application in long-context tasks, such as high-resolution image generation. This paper introduces a series of architectures adapted from the RWKV model used in the NLP, with requisite modifications tailored for diffusion model applied to image generation tasks, referred to as Diffusion-RWKV. Similar to the diffusion with Transformers, our model is designed to efficiently handle patchnified inputs in a sequence with extra conditions, while also scaling up effectively, accommodating both large-scale parameters and extensive datasets. Its distinctive advantage manifests in its reduced spatial aggregation complexity, rendering it exceptionally adept at processing high-resolution images, thereby eliminating the necessity for windowing or group cached operations. Experimental results on both condition and unconditional image generation tasks demonstrate that Diffison-RWKV achieves performance on par with or surpasses existing CNN or Transformer-based diffusion models in FID and IS metrics while significantly reducing total computation FLOP usage.

PDF Abstract

An Overview of Diffusion-RWKV: Enhancing Efficiency in Diffusion Models for Image Generation

This paper presents a novel approach to improving the efficiency and scalability of diffusion models used for high-resolution image generation by introducing RWKV-like architectures. The authors propose Diffusion-RWKV, a model that retains the fundamental characteristics of the RWKV architecture but adapts it specifically for image generation tasks within diffusion frameworks. This work addresses the computational complexity that limits traditional transformer-based models in handling long-context tasks, offering an alternative that scales effectively with large datasets and extensive parameters.

Architectural Innovations and Efficiency Improvements

The authors highlight the challenges faced by existing transformer models due to their quadratic computational complexity, which poses significant barriers in long-sequence processing, particularly in high-resolution image generation. They propose Diffusion-RWKV as a solution, leveraging the RWKV architecture's capacity to manage dependencies in sequences with linear computational complexity.

Diffusion-RWKV utilizes Bi-RWKV layers, optimizing image generation tasks by handling patch embeddings and incorporating skip connections for efficient information flow. The model demonstrates proficiency in processing high-resolution images without relying on windowing or group cached operations, a common necessity for transformers in diffusion models. This innovation significantly reduces computation FLOP usage while maintaining or exceeding the performance benchmarks set by CNN or Transformer-based diffusion models, as evidenced by improvements in FID and IS metrics.

Empirical Evaluation and Benchmarking

The paper provides a comprehensive empirical evaluation of Diffusion-RWKV models across multiple datasets, including CIFAR-10, CelebA 64x64, and ImageNet. The results show that Diffusion-RWKV achieves performance comparable to state-of-the-art diffusion models like DiT and U-ViT, with noted advantages in computational efficiency. Specifically, the model demonstrates superior FID scores with fewer parameters, exemplifying the practical benefits of the RWKV-based approach in diffusion models.

Promising Results and Implications

Diffusion-RWKV exhibits considerable promise as a viable alternative to current transformer architectures for high-resolution tasks. The model achieves a remarkable balance between efficient computational complexity and high image generation quality, making it particularly relevant for scenarios where traditional transformers face limitations due to processing requirements.

Theoretical implications of this work suggest further exploration into non-transformer approaches for long-sequence modeling in vision tasks. Practically, the reduced computational demands of Diffusion-RWKV provide an attractive option for applications requiring real-time or resource-constrained image generation.

Future Perspectives

The paper presents Diffusion-RWKV as a robust framework that could lead to further developments in the domain of image synthesis. Potential areas for future research include exploring hybrid architectures that combine the strengths of transformers and RWKV models, and extending the application of RWKV in other tasks beyond image generation. Additionally, examining the integration of advanced training strategies or fine-tuned hyperparameter selections could further enhance the model's capability and efficiency.

In conclusion, Diffusion-RWKV stands as a compelling contribution to the field, effectively addressing inefficiencies in diffusion models and setting a foundation for scalable and resource-efficient image generation frameworks.