FiT: Flexible Vision Transformer for Diffusion Model (2402.12376v4)
Abstract: Nature is infinitely resolution-free. In the context of this reality, existing diffusion models, such as Diffusion Transformers, often face challenges when processing image resolutions outside of their trained domain. To overcome this limitation, we present the Flexible Vision Transformer (FiT), a transformer architecture specifically designed for generating images with unrestricted resolutions and aspect ratios. Unlike traditional methods that perceive images as static-resolution grids, FiT conceptualizes images as sequences of dynamically-sized tokens. This perspective enables a flexible training strategy that effortlessly adapts to diverse aspect ratios during both training and inference phases, thus promoting resolution generalization and eliminating biases induced by image cropping. Enhanced by a meticulously adjusted network structure and the integration of training-free extrapolation techniques, FiT exhibits remarkable flexibility in resolution extrapolation generation. Comprehensive experiments demonstrate the exceptional performance of FiT across a broad range of resolutions, showcasing its effectiveness both within and beyond its training resolution distribution. Repository available at https://github.com/whlzy/FiT.
- All are worth words: A vit backbone for diffusion models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
- Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018.
- Language models are few-shot learners. Advances in Neural Information Processing Systems, 2020.
- Maskgit: Masked generative image transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
- Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595, 2023.
- Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 2023a.
- Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 2023b.
- Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution. arXiv preprint arXiv:2307.06304, 2023.
- Imagenet: A large-scale hierarchical image database. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2009.
- Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 2021.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Taming transformers for high-resolution image synthesis. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.
- Masked diffusion transformer is a strong image synthesizer. arXiv preprint arXiv:2303.14389, 2023.
- Diffit: Diffusion vision transformers for image generation. arXiv preprint arXiv:2312.02139, 2023.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in Neural Information Processing Systems, 2017.
- Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
- Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 2020.
- Cascaded diffusion models for high fidelity image generation. Journal of Machine Learning Research, 2022.
- Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 2005.
- Improved precision and recall metric for assessing generative models. Advances in Neural Information Processing Systems, 2019.
- Scaling laws of rope-based extrapolation. arXiv preprint arXiv:2310.05209, 2023.
- Swin transformer: Hierarchical vision transformer using shifted windows. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.
- A convnet for the 2020s. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
- LocalLLaMA. Ntk-aware scaled rope allows llama models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation. https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/. Accessed: 2024-2-1.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Generating images with sparse representations. arXiv preprint arXiv:2103.03841, 2021.
- Scalable diffusion models with transformers. In IEEE/CVF International Conference on Computer Vision, 2023.
- Yarn: Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071, 2023.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, 2021.
- High-resolution image synthesis with latent diffusion models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
- Randomized positional encodings boost length generalization of transformers. arXiv preprint arXiv:2305.16843, 2023.
- Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 2022.
- Improved techniques for training gans. Advances in Neural Information Processing Systems, 2016.
- Stylegan-xl: Scaling stylegan to large diverse datasets. In ACM SIGGRAPH 2022 conference proceedings, 2022.
- Shazeer, N. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
- Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020a.
- Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020b.
- Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 2024.
- A length-extrapolatable transformer. arXiv preprint arXiv:2212.10554, 2022.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- Fixing the train-test resolution discrepancy. Advances in Neural Information Processing Systems, 2019.
- Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pp. 10347–10357. PMLR, 2021.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
- Attention is all you need. Advances in Neural Information Processing Systems, 2017.