VisionLLaMA: A Unified LLaMA Backbone for Vision Tasks (2403.00522v2)
Abstract: LLMs are built on top of a transformer-based architecture to process textual inputs. For example, the LLaMA stands out among many open-source implementations. Can the same transformer be used to process 2D images? In this paper, we answer this question by unveiling a LLaMA-like vision transformer in plain and pyramid forms, termed VisionLLaMA, which is tailored for this purpose. VisionLLaMA is a unified and generic modelling framework for solving most vision tasks. We extensively evaluate its effectiveness using typical pre-training paradigms in a good portion of downstream tasks of image perception and especially image generation. In many cases, VisionLLaMA have exhibited substantial gains over the previous state-of-the-art vision transformers. We believe that VisionLLaMA can serve as a strong new baseline model for vision generation and understanding. Our code is released at https://github.com/Meituan-AutoML/VisionLLaMA.
- Stochastic interpolants: A unifying framework for flows and diffusions. arXiv preprint arXiv:2303.08797, 2023.
- Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021.
- Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR, 2023.
- Video generation models as world simulators. 2024.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Mmdetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155, 2019.
- Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595, 2023.
- Deconstructing denoising diffusion models for self-supervised learning. arXiv preprint arXiv:2401.14404, 2024.
- Mobilevlm: A fast, reproducible and strong vision language assistant for mobile devices. arXiv preprint arXiv:2312.16886, 2023.
- Mobilevlm v2: Faster and stronger baseline for vision language model. arXiv preprint arXiv:2402.03766, 2024.
- Twins: Revisiting the design of spatial attention in vision transformers. In Adv. Neural Inform. Process. Syst., 2021.
- Conditional positional encodings for vision transformers. In The Eleventh International Conference on Learning Representations, 2023.
- MMSegmentation Contributors. Mmsegmentation: Openmmlab semantic segmentation toolbox and benchmark, 2020.
- MMPreTrain Contributors. Openmmlab’s pre-training toolbox and benchmark. https://github.com/open-mmlab/mmpretrain, 2023.
- Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 702–703, 2020.
- Histograms of oriented gradients for human detection. In 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), volume 1, pages 886–893. Ieee, 2005.
- Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022.
- Generative adversarial nets. In Advances in Neural Information Processing Systems (NIPS), 2014.
- Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022.
- Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
- Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
- Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
- https://stability.ai/. Stable code 3b: Coding on the edge. 2024.
- Deep networks with stochastic depth. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pages 646–661. Springer, 2016.
- Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 2005.
- How much position information do convolutional neural networks encode? In International Conference on Learning Representations, 2020.
- Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- Improved precision and recall metric for assessing generative models. Advances in Neural Information Processing Systems, 32, 2019.
- Lisa: Reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692, 2023.
- Semmae: Semantic-guided masking for learning masked autoencoders. Advances in Neural Information Processing Systems, 35:14290–14302, 2022.
- Norm tweaking: High-performance low-bit quantization of large language models. In Thirty-Eighth AAAI Conference on Artificial Intelligence, 2024.
- Exploring plain vision transformer backbones for object detection. In European Conference on Computer Vision, pages 280–296. Springer, 2022.
- Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023.
- Visual instruction tuning. NeurIPS, 2023.
- Improving pixel-based mim by reducing wasted modeling capability. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5361–5372, 2023.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Fit: Flexible vision transformer for diffusion model. arXiv preprint arXiv:2402.12376, 2024.
- Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. arXiv preprint arXiv:2401.08740, 2024.
- Generating images with sparse representations. arXiv preprint arXiv:2103.03841, 2021.
- OpenAI. Gpt-4 technical report. 2023. Technical Report.
- On aliased resizing and surprising subtleties in gan evaluation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11410–11420, 2022.
- Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Hierarchical text-conditional image generation with clip latents. arXiv, 2022.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
- U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
- Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
- Improved techniques for training gans. Advances in neural information processing systems, 29, 2016.
- Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
- Self-attention with relative position representations. arXiv preprint arXiv:1803.02155, 2018.
- Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
- Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256–2265. PMLR, 2015.
- Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
- Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, page 127063, 2023.
- Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016.
- Training data-efficient image transformers and distillation through attention. In International Conference on Machine Learning, volume 139, pages 10347–10357, July 2021.
- Deit iii: Revenge of the vit. In European Conference on Computer Vision, pages 516–533. Springer, 2022.
- Llama: Open and efficient foundation language models. 2023.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Convnet vs transformer, supervised vs clip: Beyond imagenet accuracy. arXiv preprint arXiv:2311.09215, 2023.
- Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF international conference on computer vision, pages 568–578, 2021.
- Masked feature prediction for self-supervised visual pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14668–14678, 2022.
- Lenna: Language enhanced reasoning detection assistant. arXiv preprint arXiv:2312.02433, 2023.
- Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, pages 38087–38099. PMLR, 2023.
- Unified perceptual parsing for scene understanding. In Proceedings of the European conference on computer vision (ECCV), pages 418–434, 2018.
- Simmim: A simple framework for masked image modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9653–9663, 2022.
- Effective long-context scaling of foundation models. arXiv preprint arXiv:2309.16039, 2023.
- Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305, 2023.
- Large batch training of convolutional networks. arXiv preprint arXiv:1708.03888, 2017.
- Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6023–6032, 2019.
- Root mean square layer normalization. Advances in Neural Information Processing Systems, 32, 2019.
- mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
- Scene parsing through ade20k dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 633–641, 2017.
- MiniGPT-4: Enhancing vision-language understanding with advanced large language models. In The Twelfth International Conference on Learning Representations, 2024.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days freePaper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.