Papers
Topics
Authors
Recent
2000 character limit reached

VisionLLaMA: A Unified LLaMA Backbone for Vision Tasks (2403.00522v2)

Published 1 Mar 2024 in cs.CV

Abstract: LLMs are built on top of a transformer-based architecture to process textual inputs. For example, the LLaMA stands out among many open-source implementations. Can the same transformer be used to process 2D images? In this paper, we answer this question by unveiling a LLaMA-like vision transformer in plain and pyramid forms, termed VisionLLaMA, which is tailored for this purpose. VisionLLaMA is a unified and generic modelling framework for solving most vision tasks. We extensively evaluate its effectiveness using typical pre-training paradigms in a good portion of downstream tasks of image perception and especially image generation. In many cases, VisionLLaMA have exhibited substantial gains over the previous state-of-the-art vision transformers. We believe that VisionLLaMA can serve as a strong new baseline model for vision generation and understanding. Our code is released at https://github.com/Meituan-AutoML/VisionLLaMA.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (83)
  1. Stochastic interpolants: A unifying framework for flows and diffusions. arXiv preprint arXiv:2303.08797, 2023.
  2. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  3. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021.
  4. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR, 2023.
  5. Video generation models as world simulators. 2024.
  6. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  7. Mmdetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155, 2019.
  8. Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595, 2023.
  9. Deconstructing denoising diffusion models for self-supervised learning. arXiv preprint arXiv:2401.14404, 2024.
  10. Mobilevlm: A fast, reproducible and strong vision language assistant for mobile devices. arXiv preprint arXiv:2312.16886, 2023.
  11. Mobilevlm v2: Faster and stronger baseline for vision language model. arXiv preprint arXiv:2402.03766, 2024.
  12. Twins: Revisiting the design of spatial attention in vision transformers. In Adv. Neural Inform. Process. Syst., 2021.
  13. Conditional positional encodings for vision transformers. In The Eleventh International Conference on Learning Representations, 2023.
  14. MMSegmentation Contributors. Mmsegmentation: Openmmlab semantic segmentation toolbox and benchmark, 2020.
  15. MMPreTrain Contributors. Openmmlab’s pre-training toolbox and benchmark. https://github.com/open-mmlab/mmpretrain, 2023.
  16. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 702–703, 2020.
  17. Histograms of oriented gradients for human detection. In 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), volume 1, pages 886–893. Ieee, 2005.
  18. Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023.
  19. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  20. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  21. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
  22. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  23. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022.
  24. Generative adversarial nets. In Advances in Neural Information Processing Systems (NIPS), 2014.
  25. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022.
  26. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
  27. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
  28. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  29. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
  30. https://stability.ai/. Stable code 3b: Coding on the edge. 2024.
  31. Deep networks with stochastic depth. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pages 646–661. Springer, 2016.
  32. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 2005.
  33. How much position information do convolutional neural networks encode? In International Conference on Learning Representations, 2020.
  34. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  35. Improved precision and recall metric for assessing generative models. Advances in Neural Information Processing Systems, 32, 2019.
  36. Lisa: Reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692, 2023.
  37. Semmae: Semantic-guided masking for learning masked autoencoders. Advances in Neural Information Processing Systems, 35:14290–14302, 2022.
  38. Norm tweaking: High-performance low-bit quantization of large language models. In Thirty-Eighth AAAI Conference on Artificial Intelligence, 2024.
  39. Exploring plain vision transformer backbones for object detection. In European Conference on Computer Vision, pages 280–296. Springer, 2022.
  40. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023.
  41. Visual instruction tuning. NeurIPS, 2023.
  42. Improving pixel-based mim by reducing wasted modeling capability. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5361–5372, 2023.
  43. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
  44. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  45. Fit: Flexible vision transformer for diffusion model. arXiv preprint arXiv:2402.12376, 2024.
  46. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. arXiv preprint arXiv:2401.08740, 2024.
  47. Generating images with sparse representations. arXiv preprint arXiv:2103.03841, 2021.
  48. OpenAI. Gpt-4 technical report. 2023. Technical Report.
  49. On aliased resizing and surprising subtleties in gan evaluation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11410–11420, 2022.
  50. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023.
  51. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  52. Hierarchical text-conditional image generation with clip latents. arXiv, 2022.
  53. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  54. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
  55. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
  56. Improved techniques for training gans. Advances in neural information processing systems, 29, 2016.
  57. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
  58. Self-attention with relative position representations. arXiv preprint arXiv:1803.02155, 2018.
  59. Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
  60. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256–2265. PMLR, 2015.
  61. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
  62. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, page 127063, 2023.
  63. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016.
  64. Training data-efficient image transformers and distillation through attention. In International Conference on Machine Learning, volume 139, pages 10347–10357, July 2021.
  65. Deit iii: Revenge of the vit. In European Conference on Computer Vision, pages 516–533. Springer, 2022.
  66. Llama: Open and efficient foundation language models. 2023.
  67. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  68. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  69. Convnet vs transformer, supervised vs clip: Beyond imagenet accuracy. arXiv preprint arXiv:2311.09215, 2023.
  70. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF international conference on computer vision, pages 568–578, 2021.
  71. Masked feature prediction for self-supervised visual pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14668–14678, 2022.
  72. Lenna: Language enhanced reasoning detection assistant. arXiv preprint arXiv:2312.02433, 2023.
  73. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, pages 38087–38099. PMLR, 2023.
  74. Unified perceptual parsing for scene understanding. In Proceedings of the European conference on computer vision (ECCV), pages 418–434, 2018.
  75. Simmim: A simple framework for masked image modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9653–9663, 2022.
  76. Effective long-context scaling of foundation models. arXiv preprint arXiv:2309.16039, 2023.
  77. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305, 2023.
  78. Large batch training of convolutional networks. arXiv preprint arXiv:1708.03888, 2017.
  79. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6023–6032, 2019.
  80. Root mean square layer normalization. Advances in Neural Information Processing Systems, 32, 2019.
  81. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
  82. Scene parsing through ade20k dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 633–641, 2017.
  83. MiniGPT-4: Enhancing vision-language understanding with advanced large language models. In The Twelfth International Conference on Learning Representations, 2024.
Citations (2)

Summary

  • The paper presents VisionLLaMA, a unified LLaMA model for vision tasks that employs AS2DRoPE to extend 1D positional encoding into 2D for handling varied image resolutions.
  • It integrates both plain and pyramid transformer architectures with supervised and self-supervised learning to flexibly adapt a text-centric model to image processing.
  • Experimental results demonstrate that VisionLLaMA outperforms conventional vision transformers in key benchmarks, including image generation, classification, segmentation, and detection.

VisionLLaMA: Bridging LLaMA to Vision Through a Versatile Transformer Architecture

Introduction

The advent of LLMs like LLaMA has led to significant advancements in natural language processing. VisionLLaMA brings these advancements to the vision domain by adapting the LLaMA architecture for a wide range of vision tasks. The proposed architecture, VisionLLaMA, leverages both plain and pyramid forms to efficiently tackle image comprehension and creation tasks. This research demonstrates VisionLLaMA's superior performance over conventional vision transformers across several benchmarks, particularly highlighting its strengths in image generation, classification, semantic segmentation, and object detection.

Methodology

VisionLLaMA adapts LLaMA's architecture to the vision domain through innovations like the auto-scaled 2D Rotational Positional Encoding (AS2DRoPE), which extends the LLaMA model's rotational positional encoding from 1D to 2D. This adaptation caters to the two-dimensional nature of images and supports various resolutions, a critical requirement for vision tasks. The paper evaluates VisionLLaMA under two architectural schemes—plain and pyramid transformers—and across different training paradigms (supervised and self-supervised learning), demonstrating its flexibility and its compatibility with existing transformer paradigms for vision tasks.

The technical implementation details are crucial to understanding how VisionLLaMA addresses the inherent challenges of adapting a text-centric model architecture to image-related tasks. Specifically, the development of AS2DRoPE is a notable contribution that facilitates the model's ability to handle images of arbitrary resolutions effectively.

Experimental Results

VisionLLaMA's effectiveness is rigorously evaluated across a variety of representative vision tasks, where it consistently outperforms existing state-of-the-art vision transformers. Notably, VisionLLaMA demonstrates substantial gains in image generation tasks, showcasing its robust generative capabilities. Furthermore, its performance in image classification, segmentation, and detection tasks underlines its versatility and potential as a new baseline model for future research and applications in the vision domain.

Practical and Theoretical Implications

The introduction of VisionLLaMA has both practical and theoretical implications. Practically, its superior performance and flexibility make it a promising candidate for a wide range of applications, from enhancing existing vision systems to powering new innovative tools. Theoretically, the success of VisionLLaMA further validates the potential of adapting LLM architectures to non-language tasks, potentially opening avenues for similar cross-domain adaptations. Additionally, the architectural innovations like AS2DRoPE introduced in this work provide a framework for extending transformer models to handle more complex, multidimensional data across various domains.

Future Directions

VisionLLaMA's achievements pave the way for exciting future developments. One prospective avenue is the exploration of enhanced positional encoding schemes that could offer even greater efficiency and flexibility. Additionally, the potential for integrating VisionLLaMA into multimodal models, where it can process both textual and visual inputs, presents an intriguing prospect for the development of more capable and versatile AI systems. Further refinements to the architecture, training paradigms, and the incorporation of feedback mechanisms could also enhance its performance and applicability to a broader range of tasks.

In conclusion, VisionLLaMA represents a significant stride toward unified model architectures for processing diverse data types. Its success not only underscores the versatility of the LLaMA architecture but also sets a solid foundation for future interdisciplinary research in AI, potentially heralding a new era of cross-modal AI systems driven by versatile, efficient, and powerful unified models.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 tweet and received 0 likes.

Upgrade to Pro to view all of the tweets about this paper: