Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FiT: Flexible Vision Transformer for Diffusion Model (2402.12376v4)

Published 19 Feb 2024 in cs.CV

Abstract: Nature is infinitely resolution-free. In the context of this reality, existing diffusion models, such as Diffusion Transformers, often face challenges when processing image resolutions outside of their trained domain. To overcome this limitation, we present the Flexible Vision Transformer (FiT), a transformer architecture specifically designed for generating images with unrestricted resolutions and aspect ratios. Unlike traditional methods that perceive images as static-resolution grids, FiT conceptualizes images as sequences of dynamically-sized tokens. This perspective enables a flexible training strategy that effortlessly adapts to diverse aspect ratios during both training and inference phases, thus promoting resolution generalization and eliminating biases induced by image cropping. Enhanced by a meticulously adjusted network structure and the integration of training-free extrapolation techniques, FiT exhibits remarkable flexibility in resolution extrapolation generation. Comprehensive experiments demonstrate the exceptional performance of FiT across a broad range of resolutions, showcasing its effectiveness both within and beyond its training resolution distribution. Repository available at https://github.com/whlzy/FiT.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. All are worth words: A vit backbone for diffusion models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
  2. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018.
  3. Language models are few-shot learners. Advances in Neural Information Processing Systems, 2020.
  4. Maskgit: Masked generative image transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
  5. Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595, 2023.
  6. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 2023a.
  7. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 2023b.
  8. Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution. arXiv preprint arXiv:2307.06304, 2023.
  9. Imagenet: A large-scale hierarchical image database. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2009.
  10. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 2021.
  11. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  12. Taming transformers for high-resolution image synthesis. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.
  13. Masked diffusion transformer is a strong image synthesizer. arXiv preprint arXiv:2303.14389, 2023.
  14. Diffit: Diffusion vision transformers for image generation. arXiv preprint arXiv:2312.02139, 2023.
  15. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in Neural Information Processing Systems, 2017.
  16. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
  17. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 2020.
  18. Cascaded diffusion models for high fidelity image generation. Journal of Machine Learning Research, 2022.
  19. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 2005.
  20. Improved precision and recall metric for assessing generative models. Advances in Neural Information Processing Systems, 2019.
  21. Scaling laws of rope-based extrapolation. arXiv preprint arXiv:2310.05209, 2023.
  22. Swin transformer: Hierarchical vision transformer using shifted windows. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.
  23. A convnet for the 2020s. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
  24. LocalLLaMA. Ntk-aware scaled rope allows llama models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation. https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/. Accessed: 2024-2-1.
  25. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  26. Generating images with sparse representations. arXiv preprint arXiv:2103.03841, 2021.
  27. Scalable diffusion models with transformers. In IEEE/CVF International Conference on Computer Vision, 2023.
  28. Yarn: Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071, 2023.
  29. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, 2021.
  30. High-resolution image synthesis with latent diffusion models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
  31. Randomized positional encodings boost length generalization of transformers. arXiv preprint arXiv:2305.16843, 2023.
  32. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 2022.
  33. Improved techniques for training gans. Advances in Neural Information Processing Systems, 2016.
  34. Stylegan-xl: Scaling stylegan to large diverse datasets. In ACM SIGGRAPH 2022 conference proceedings, 2022.
  35. Shazeer, N. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
  36. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020a.
  37. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020b.
  38. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 2024.
  39. A length-extrapolatable transformer. arXiv preprint arXiv:2212.10554, 2022.
  40. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  41. Fixing the train-test resolution discrepancy. Advances in Neural Information Processing Systems, 2019.
  42. Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pp.  10347–10357. PMLR, 2021.
  43. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  44. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  45. Attention is all you need. Advances in Neural Information Processing Systems, 2017.
Citations (30)

Summary

  • The paper introduces a flexible training pipeline that preserves image aspect ratios by dynamically resizing images, eliminating cropping distortions.
  • The paper employs 2D Rotary Positional Embedding and SwiGLU within its novel transformer architecture to adeptly handle variable image sizes and maintain efficiency.
  • The paper demonstrates superior resolution extrapolation by generating high-quality images beyond the training distribution, setting new benchmarks on ImageNet.

Flexible Vision Transformer for Unrestricted Resolution Image Generation

Introduction

In the evolving landscape of image generation, the quest for models that generalize across arbitrary resolutions is paramount. The recently introduced Flexible Vision Transformer (FiT) emerges as a significant advancement in this direction, fundamentally altering the way images are perceived and generated. By conceptualizing images as sequences of dynamically-sized tokens, FiT transcends the limitations of fixed dimensionality, heralding a new era of resolution-independent image synthesis.

Core Contributions

FiT introduces several innovative design elements, each contributing to its exceptional performance:

  • Flexible Training Pipeline: This approach allows for the preservation of original image aspect ratios by dynamically resizing images to fit within a predefined token limit, thereby eliminating the need for cropping or disproportionate scaling.
  • Novel Transformer Architecture: At its core, FiT incorporates 2D Rotary Positional Embedding (RoPE) and Swish-Gated Linear Unit (SwiGLU), enabling the model to adeptly handle variable image sizes and maintain efficiency across varying resolutions.
  • Resolution Extrapolation Method: Leveraging techniques from LLMs, FiT introduces a training-free extrapolation method, allowing for the generation of images at resolutions beyond those encountered during training.

Experimental Insights

FiT exhibits remarkable versatility and performance across a broad spectrum of resolutions, as evidenced by strict experimental evaluations. Notably, at higher resolutions and aspect ratios substantially different from the training distribution, FiT's capabilities shine, outperforming state-of-the-art models by significant margins. For instance, in class-conditional image generation on the ImageNet dataset, FiT achieved leading FID scores at various resolutions, setting new benchmarks for image synthesis quality.

Architectural Innovations

A key aspect of FiT's success is its architectural improvements over predecessors. The replacement of MHSA with Masked MHSA and the transition from MLP to SwiGLU, coupled with the adoption of 2D RoPE, collectively enhance the model's flexibility and efficiency. These architectural choices enable FiT to adeptly manage variable-length sequences and generate high-quality images across a diverse range of resolutions and aspect ratios.

Extrapolation Capabilities

FiT's robust resolution extrapolation process facilitates image generation beyond the confines of the training distribution. Through innovative interpolation methods inspired by advancements in LLMs, such as VisionNTK and VisionYaRN, FiT adeptly synthesizes images at unprecedented resolutions, showcasing its potent extrapolation capabilities. This allows FiT to effectively adapt to and excel at producing images with arbitrary resolutions and aspect ratios, a feat not readily achievable by previous models.

Future Directions

The introduction of FiT represents a significant step forward in the domain of image generation, particularly in the context of resolution and aspect ratio flexibility. Looking ahead, FiT's versatile architecture and innovative methodologies offer a promising foundation for further research and development. Potential future directions include the exploration of FiT's applicability to other domains beyond image generation, refinement of its extrapolation methods for even greater efficiency, and adaptation to leverage emerging computational paradigms.

In summary, the FiT model substantiates the feasibility of generating high-quality images across a vast spectrum of resolutions and aspect ratios, effectively addressing a longstanding challenge in the field. Its comprehensive design, coupled with exceptional performance, positions FiT as a pivotal model for future explorations in generative image synthesis.