VisionLLaMA: A Unified LLaMA Backbone for Vision Tasks
Abstract: LLMs are built on top of a transformer-based architecture to process textual inputs. For example, the LLaMA stands out among many open-source implementations. Can the same transformer be used to process 2D images? In this paper, we answer this question by unveiling a LLaMA-like vision transformer in plain and pyramid forms, termed VisionLLaMA, which is tailored for this purpose. VisionLLaMA is a unified and generic modelling framework for solving most vision tasks. We extensively evaluate its effectiveness using typical pre-training paradigms in a good portion of downstream tasks of image perception and especially image generation. In many cases, VisionLLaMA have exhibited substantial gains over the previous state-of-the-art vision transformers. We believe that VisionLLaMA can serve as a strong new baseline model for vision generation and understanding. Our code is released at https://github.com/Meituan-AutoML/VisionLLaMA.
- Stochastic interpolants: A unifying framework for flows and diffusions. arXiv preprint arXiv:2303.08797, 2023.
- Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021.
- Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR, 2023.
- Video generation models as world simulators. 2024.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Mmdetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155, 2019.
- Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595, 2023.
- Deconstructing denoising diffusion models for self-supervised learning. arXiv preprint arXiv:2401.14404, 2024.
- Mobilevlm: A fast, reproducible and strong vision language assistant for mobile devices. arXiv preprint arXiv:2312.16886, 2023.
- Mobilevlm v2: Faster and stronger baseline for vision language model. arXiv preprint arXiv:2402.03766, 2024.
- Twins: Revisiting the design of spatial attention in vision transformers. In Adv. Neural Inform. Process. Syst., 2021.
- Conditional positional encodings for vision transformers. In The Eleventh International Conference on Learning Representations, 2023.
- MMSegmentation Contributors. Mmsegmentation: Openmmlab semantic segmentation toolbox and benchmark, 2020.
- MMPreTrain Contributors. Openmmlab’s pre-training toolbox and benchmark. https://github.com/open-mmlab/mmpretrain, 2023.
- Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 702–703, 2020.
- Histograms of oriented gradients for human detection. In 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), volume 1, pages 886–893. Ieee, 2005.
- Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022.
- Generative adversarial nets. In Advances in Neural Information Processing Systems (NIPS), 2014.
- Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022.
- Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
- Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
- Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
- https://stability.ai/. Stable code 3b: Coding on the edge. 2024.
- Deep networks with stochastic depth. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pages 646–661. Springer, 2016.
- Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 2005.
- How much position information do convolutional neural networks encode? In International Conference on Learning Representations, 2020.
- Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- Improved precision and recall metric for assessing generative models. Advances in Neural Information Processing Systems, 32, 2019.
- Lisa: Reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692, 2023.
- Semmae: Semantic-guided masking for learning masked autoencoders. Advances in Neural Information Processing Systems, 35:14290–14302, 2022.
- Norm tweaking: High-performance low-bit quantization of large language models. In Thirty-Eighth AAAI Conference on Artificial Intelligence, 2024.
- Exploring plain vision transformer backbones for object detection. In European Conference on Computer Vision, pages 280–296. Springer, 2022.
- Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023.
- Visual instruction tuning. NeurIPS, 2023.
- Improving pixel-based mim by reducing wasted modeling capability. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5361–5372, 2023.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Fit: Flexible vision transformer for diffusion model. arXiv preprint arXiv:2402.12376, 2024.
- Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. arXiv preprint arXiv:2401.08740, 2024.
- Generating images with sparse representations. arXiv preprint arXiv:2103.03841, 2021.
- OpenAI. Gpt-4 technical report. 2023. Technical Report.
- On aliased resizing and surprising subtleties in gan evaluation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11410–11420, 2022.
- Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Hierarchical text-conditional image generation with clip latents. arXiv, 2022.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
- U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
- Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
- Improved techniques for training gans. Advances in neural information processing systems, 29, 2016.
- Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
- Self-attention with relative position representations. arXiv preprint arXiv:1803.02155, 2018.
- Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
- Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256–2265. PMLR, 2015.
- Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
- Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, page 127063, 2023.
- Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016.
- Training data-efficient image transformers and distillation through attention. In International Conference on Machine Learning, volume 139, pages 10347–10357, July 2021.
- Deit iii: Revenge of the vit. In European Conference on Computer Vision, pages 516–533. Springer, 2022.
- Llama: Open and efficient foundation language models. 2023.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Convnet vs transformer, supervised vs clip: Beyond imagenet accuracy. arXiv preprint arXiv:2311.09215, 2023.
- Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF international conference on computer vision, pages 568–578, 2021.
- Masked feature prediction for self-supervised visual pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14668–14678, 2022.
- Lenna: Language enhanced reasoning detection assistant. arXiv preprint arXiv:2312.02433, 2023.
- Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, pages 38087–38099. PMLR, 2023.
- Unified perceptual parsing for scene understanding. In Proceedings of the European conference on computer vision (ECCV), pages 418–434, 2018.
- Simmim: A simple framework for masked image modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9653–9663, 2022.
- Effective long-context scaling of foundation models. arXiv preprint arXiv:2309.16039, 2023.
- Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305, 2023.
- Large batch training of convolutional networks. arXiv preprint arXiv:1708.03888, 2017.
- Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6023–6032, 2019.
- Root mean square layer normalization. Advances in Neural Information Processing Systems, 32, 2019.
- mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
- Scene parsing through ade20k dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 633–641, 2017.
- MiniGPT-4: Enhancing vision-language understanding with advanced large language models. In The Twelfth International Conference on Learning Representations, 2024.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Overview
This paper introduces VisionLLaMA, a way to use the same kind of “brain” that powers popular LLMs (like LLaMA) to understand and create images. The big idea is to build one transformer-based model that works well across many vision tasks, from recognizing objects in photos to generating new pictures.
What questions did the researchers ask?
- Can the LLaMA-style transformer (built for text) be adapted to handle images, which are 2D and often come in different sizes?
- Can this single model design work across many vision jobs (like classification, segmentation, detection, and image generation)?
- Can it match or beat existing top vision models while keeping the architecture simple and easy to deploy?
How did they approach it?
Think of a transformer as a very smart librarian that looks at all parts of the input and figures out which parts are important to each other. For images, the researchers did the following:
Turning images into “tokens”
- Images are split into small squares (patches), like cutting a picture into a grid of puzzle pieces.
- These patches are then fed into the transformer, which decides which pieces matter to each other.
Two model shapes: plain and pyramid
- Plain transformer: Looks at the image at one scale (one zoom level). Simple and similar to the original LLaMA.
- Pyramid transformer: Looks at the image at multiple scales (zoomed-in details and zoomed-out context), which often helps for vision tasks. It mixes:
- Local self-attention (LSA): focuses on nearby patches (like looking closely at small areas).
- Global sub-sampled attention (GSA): looks at a summary of the whole image (like stepping back to see the big picture).
Teaching the model where things are (positional encoding)
- Transformers need to know the position of each patch, or they’ll treat the image like a shuffled deck.
- The team extended “RoPE” (Rotary Positional Embedding), which worked in 1D for text, to 2D for images. Think of RoPE as tiny “direction arrows” attached to each patch, telling the model where it sits in the image.
- They introduced AS2DRoPE (Auto-Scaled 2D RoPE), which automatically adjusts these position signals when images are larger or smaller than the training size. This helps the model handle different image resolutions without retraining.
Training styles and tasks
- Supervised training: The model learns from labeled data (e.g., this is a cat).
- Self-supervised training (Masked Autoencoding, MAE): The model learns by hiding random parts of an image and trying to reconstruct them, like solving a jigsaw puzzle with missing pieces.
- Tested on many tasks:
- Image generation (making new pictures) using diffusion models (DiT and SiT frameworks).
- Diffusion models are like teaching the model to start from noise and “clean” it step by step into a realistic image.
- Image classification (what’s in the picture).
- Semantic segmentation (coloring each pixel to show what object it belongs to).
- Object detection (drawing boxes around things and naming them).
What did they find?
VisionLLaMA performed strongly across the board, often better than existing models:
- Image generation:
- Replacing the generator’s backbone with VisionLLaMA improved image quality scores (lower FID is better). In several setups, VisionLLaMA had clearly better FID and other metrics than DiT and SiT.
- It also reached good results faster (fewer training steps needed for strong performance).
- Classification (ImageNet):
- VisionLLaMA matched or slightly beat strong baselines like DeiT3 and Twins.
- Importantly, it handled larger image resolutions better without retraining, thanks to AS2DRoPE. That’s useful for tasks that need bigger images.
- Semantic segmentation (ADE20K):
- With the pyramid setup, it improved mIoU (a standard accuracy measure) by around 1–2 percentage points over popular backbones like Swin and Twins.
- Object detection (COCO):
- It improved both box mAP (how well it finds objects) and mask mAP (how well it outlines them) over Swin and Twins.
- In a self-supervised setup, it achieved better results with much shorter training (about one-third of the training budget used by a baseline).
- Self-supervised pretraining (MAE):
- After pretraining with MAE, VisionLLaMA scored higher on both full fine-tuning and linear probing (a fair way to test learned representations). Gains were noticeable and consistent.
Overall, VisionLLaMA often trained faster and reached higher accuracy than the best previous vision transformers, while staying close to the simple LLaMA-like design.
Why does this matter?
- One architecture, many jobs: Using a unified transformer design for both text and vision can simplify machine learning systems. It makes models easier to build, optimize, and deploy across tasks.
- Better at handling different image sizes: AS2DRoPE lets the model work on bigger or smaller images without retraining, which is practical for real-world applications.
- Stronger and more efficient: Faster training and better results mean less time and compute power to get high-quality models.
- A solid foundation: VisionLLaMA can be a new baseline for future work in image understanding and generation, and potentially help multimodal models (that read text and see images) work more smoothly together.
The authors plan to release code, which can help other researchers and developers build on these results and create better vision systems for everything from phones to robots to creative tools.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a focused list of what remains missing, uncertain, or unexplored in the paper, stated concretely to guide future research:
- Theoretical guarantees for AS2DRoPE (2D RoPE with auto-scaling) are absent: no analysis of stability, invariance to scaling, error bounds when interpolating/extrapolating, or its behavior under varying frequency bases and dimensional allocations.
- Aspect-ratio generalization is untested and under-specified: AS2DRoPE is derived assuming square images () and evaluated mostly at fixed aspect ratios; how it behaves for non-square inputs and diverse aspect ratios (e.g., panoramic, portrait) is not demonstrated.
- The AS2DRoPE formula is incomplete in the text (the central equation is truncated), which impedes reproducibility; a complete specification (including how queries/keys are rotated and how scale factors are computed from anchor resolution) is needed.
- Position calibration for GSA (global sub-sampled attention) is illustrated only with a toy example; a general formula for arbitrary kernel sizes, strides, dilations, padding, and nonuniform sampling is missing.
- The choice to share the same 2D RoPE across all attention heads is only lightly ablated; the trade-off versus head-specific positional encodings (including per-head frequency bases and axis decoupling) across tasks and scales remains unexplored.
- Inter-axis frequency coupling in 2D RoPE (sharing the same frequency for both axes) is assumed without rigorous comparison to independent frequency schedules; impact on anisotropic patterns, elongated objects, and perspective distortion is unknown.
- Generalization across resolutions is evaluated mainly for classification; downstream tasks with truly variable resolutions (e.g., detector training with multi-scale sampling and test-time varying short-side/long-side policies) are not systematically studied.
- Dynamic-resolution training is not explored: AS2DRoPE is proposed to handle arbitrary resolutions mostly at inference; training with variable input sizes, curriculum schedules, or multi-scale policies could change outcomes.
- Plain VisionLLaMA scalability to high-resolution inputs is not evaluated (beyond 768 in classification and 256 in generation); its practicality given quadratic attention cost for , , or larger inputs is unknown.
- Video understanding/generation is not studied despite being a prime use-case for RoPE-based long-context extensions; no temporal positional encoding design, latency constraints, or temporal resolution scaling strategy is provided.
- Multimodal integration is not demonstrated: although motivated by LLaMA architecture unification, there are no experiments on vision–language pretraining, alignment (e.g., CLIP-style), or end-to-end VLM tasks (e.g., captioning, VQA).
- Generation experiments are limited to 256×256; claims of AS2DRoPE enabling arbitrary resolutions are not validated for higher resolutions (e.g., 512, 1024), and the impact on FID/sFID, speed, and memory at those scales is unknown.
- The image generation datasets and training data specifics are unclear; a detailed accounting (dataset composition, data licenses, preprocessing) is needed for reproducibility and to assess distributional generalization and ethical considerations.
- Sensitivity to classifier-free guidance (CFG) is not thoroughly studied; only a few CFG settings are shown—no systematic sweeps or analysis of how CFG interacts with VisionLLaMA architecture and sampling schedules.
- Sampler choice analysis in SiT/DiT (ODE vs SDE) is limited; comprehensive trade-offs (quality vs compute vs stability) across samplers, step counts, and noise schedules for VisionLLaMA are missing.
- Architectural gains vs algorithmic accelerations are conflated: flash attention and mixed precision are used, but their isolated contributions are not ablated, making it unclear how much gain stems from the architecture itself.
- Hyperparameter fairness is not fully ensured: many baselines are run “as released” and VisionLLaMA is integrated without retuning; rigorous, matched hyperparameter searches for all models are required to make definitive claims.
- Training stability and variance across seeds are not characterized; the reported low variance for ViT-L in ablations is anecdotal—systematic seed runs and confidence intervals for each benchmark are missing.
- Scaling laws are not studied: how performance scales with parameters, data, and compute for VisionLLaMA (plain and pyramid) compared to ViT/Swin/Twins across tasks is unexplored.
- Memory footprint and latency benchmarks are incomplete; throughput is reported sparsely and not consistently for VisionLLaMA variants, and there are no end-to-end latency measurements (e.g., on A100 vs consumer GPUs or mobile/edge devices).
- Applicability of LLaMA-optimized inference techniques (e.g., GPTQ quantization, SmoothQuant, speculative decoding analogs, kernel fusion) to VisionLLaMA is claimed but not empirically validated on vision workloads.
- Patch size sensitivity and its interaction with AS2DRoPE are not studied (e.g., P=8/16/32 for plain ViT vs latent patch choices in DiT/SiT); downstream impacts on generation fidelity and dense prediction are unknown.
- Pyramid VisionLLaMA design choices (e.g., removing conditional PE, kernel sizes/strides for GSA) are minimally ablated; the interplay between local windowing, global attention sampling density, and 2D RoPE is not deeply analyzed.
- Downstream evaluations miss task-specific diagnostics: COCO AP is reported without size-wise breakdown (APs/APm/APl), and ADE20K segmentation lacks per-class, boundary, or region-wise analyses to understand where gains come from.
- Multi-scale inference policies (single vs multi-scale, test-time augmentation) are limited; VisionLLaMA’s robustness and gains under stronger inference protocols are unknown.
- SSL breadth is narrow: only MAE-style pretraining is used; comparisons with diverse SSL regimes (e.g., DINO/iBOT/MaskDistillation/MoCo v3) and their synergy with AS2DRoPE are missing.
- Cross-dataset generalization is limited: pretraining is constrained to ImageNet-1K; performance under domain shifts (e.g., ImageNet-21K, COCO-Stuff, LVIS, Cityscapes, OpenImages) or low-data regimes is not assessed.
- Unified weights across tasks are not demonstrated: the paper trains separate models per task; whether one VisionLLaMA backbone can be pre-trained once and fine-tuned effectively across generation and perception tasks remains open.
- Robustness and security aspects (adversarial resilience, noise robustness, occlusions, corruptions) are not examined; how 2D RoPE affects vulnerability or robustness is unknown.
- Ethical considerations for generative results are not addressed (content safety, bias, misuse potential), nor are dataset filtering and prompt safety measures discussed.
- Implementation details for AS2DRoPE anchoring are under-specified: criteria for choosing the “anchor resolution,” handling of mixed-resolution batches, and the effect of anchor mismatch during fine-tuning are not investigated.
- The observed failure of 1D RoPE at larger image resolutions (“severely degrades to zero” at 448×448) is not analyzed; understanding the failure mode and boundary conditions could inform positional design.
- Combining VisionLLaMA with complementary positional modules (e.g., PEG/CPE) shows slight gains in one ablation, but a systematic study of hybrid positional encodings across tasks and scales is missing.
- Resource accounting (wall-clock time, GPU-hours per benchmark) is not provided; cost–performance trade-offs, energy considerations, and carbon footprint are essential for large-scale training claims.
- Extension to additional vision tasks (e.g., depth estimation, keypoint detection, tracking, instance segmentation beyond Mask R-CNN, panoptic segmentation) is absent; the “unified” claim would benefit from broader coverage.
- Code and pretrained models are not yet available at the stated URL; until release, reproducibility and external verification are blocked.
Collections
Sign up for free to add this paper to one or more collections.