Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PaLI-3 Vision Language Models: Smaller, Faster, Stronger (2310.09199v2)

Published 13 Oct 2023 in cs.CV

Abstract: This paper presents PaLI-3, a smaller, faster, and stronger vision LLM (VLM) that compares favorably to similar models that are 10x larger. As part of arriving at this strong performance, we compare Vision Transformer (ViT) models pretrained using classification objectives to contrastively (SigLIP) pretrained ones. We find that, while slightly underperforming on standard image classification benchmarks, SigLIP-based PaLI shows superior performance across various multimodal benchmarks, especially on localization and visually-situated text understanding. We scale the SigLIP image encoder up to 2 billion parameters, and achieves a new state-of-the-art on multilingual cross-modal retrieval. We hope that PaLI-3, at only 5B parameters, rekindles research on fundamental pieces of complex VLMs, and could fuel a new generation of scaled-up models.

Overview of PaLI-3: Advanced Vision LLMs

The PaLI-3 model represents a significant step forward in the field of vision-LLMs (VLMs) by embodying a potent combination of reduced size, increased speed, and enhanced performance. Unlike many contemporary models that scale into tens of billions of parameters, PaLI-3 delivers comparable, and in many cases superior, performance with only 5 billion parameters. This positions it as an attractive option for resource-efficient model deployment and offers insights into the efficacy of advanced pretraining techniques.

Key Innovations

The notable innovations of PaLI-3 center around three main improvements:

  1. Pretraining Approach: The model utilizes a contrastive pretraining strategy (SigLIP) for its image encoder, diverging from traditional classification-based pretraining. This approach exploits web-scale image-text data, which results in superior performance across diverse multimodal tasks, particularly those that require visually-situated text understanding and object localization.
  2. Dataset and Training Enhancements: PaLI-3 refines its multimodal training through an improved mix of datasets that better supports the variety of tasks, such as cross-modal retrieval and visually-situated tasks. It also incorporates high-resolution inputs which contribute significantly to model accuracy.
  3. Scalability and Efficiency: The model's scalability is demonstrated by its impressive performance on benchmarks despite being an order of magnitude smaller than competing models. This highlights the potential of contrastive pretraining to extract more meaningful representations in a compact parameter space.

Performance and Benchmarking

PaLI-3 sets new standards in state-of-the-art performance across several tasks:

  • Multimodal Tasks: The model achieves leading results in multilingual cross-modal retrieval, displaying robust improvements over previous state-of-the-art models in languages that face significant resource challenges.
  • Scene Text and Localization Tasks: Notably, PaLI-3 excels in tasks like TextVQA and Referring Expression Segmentation, demonstrating the advantages of SigLIP pretraining in dealing with tasks that require intricate understanding of spatial and textual overlays.
  • General Vision Tasks: Even without video-specific pretraining data, PaLI-3 performs admirably on video QA benchmarks, illustrating its generalization capabilities.

Theoretical and Practical Implications

PaLI-3's development offers new research pathways in VLM architecture design, particularly regarding the application of contrastive pretraining techniques in smaller, more efficient models. The research indicates that pretraining strategies that move beyond the conventional classification tasks can substantially enhance model performance in complex task domains. This pivot towards utilizing noisy, yet large-scale web data aligns with broader trends in AI research that aim to leverage abundant, less curated data as a source of robust learning signals.

Future Directions

The research team highlights several avenues for future work, notably in refining the pretraining processes further and extending the scope of tasks that VLMs can address effectively. Continued investigation into how vision and language representations can be jointly learned will likely yield additional improvements in model interoperability and versatility.

In summary, PaLI-3 represents a significant stride towards efficient, high-performance VLMs that do not necessitate exorbitant computational resources, fostering advancements in both applied and theoretical domains of artificial intelligence research. By leveraging contrastive image-text pretraining paradigms, PaLI-3 lays the groundwork for future explorations into the rich potential of smaller, context-aware models in AI.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (61)
  1. Tallyqa: Answering complex counting questions. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019, pp.  8076–8084. AAAI Press, 2019. doi: 10.1609/aaai.v33i01.33018076. URL https://doi.org/10.1609/aaai.v33i01.33018076.
  2. Flamingo: a visual language model for few-shot learning. arXiv preprint arXiv:2204.14198, 2022.
  3. Are we done with ImageNet? arXiv preprint arXiv:2006.07159, 2020.
  4. Scene text visual question answering. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  4291–4301, 2019.
  5. Pali-x: On scaling up a multilingual vision and language model, 2023a.
  6. PaLI: A jointly-scaled multilingual language-image model. In ICLR, 2023b.
  7. Fine-grained text-video retrieval with frozen image encoders. CoRR, abs/2307.09972, 2023. doi: 10.48550/arXiv.2307.09972. URL https://doi.org/10.48550/arXiv.2307.09972.
  8. Scaling vision transformers to 22 billion parameters, 2023. URL https://arxiv.org/abs/2302.05442.
  9. An image is worth 16x16 words: Transformers for image recognition at scale. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URL https://openreview.net/forum?id=YicbFdNTTy.
  10. Palm-e: An embodied multimodal language model. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pp. 8469–8488. PMLR, 2023. URL https://proceedings.mlr.press/v202/driess23a.html.
  11. Fairness through awareness. In Shafi Goldwasser (ed.), Innovations in Theoretical Computer Science 2012, Cambridge, MA, USA, January 8-10, 2012, pp.  214–226. ACM, 2012. doi: 10.1145/2090236.2090255. URL https://doi.org/10.1145/2090236.2090255.
  12. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  6904–6913, 2017.
  13. Scaling up visual and vision-language representation learning with noisy text supervision. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pp.  4904–4916. PMLR, 2021. URL http://proceedings.mlr.press/v139/jia21b.html.
  14. Fairface: Face attribute dataset for balanced race, gender, and age for bias measurement and mitigation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp.  1548–1558, 2021.
  15. Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In CVPR, 2015.
  16. Big transfer (bit): General visual representation learning. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (eds.), Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part V, volume 12350 of Lecture Notes in Computer Science, pp.  491–507. Springer, 2020. doi: 10.1007/978-3-030-58558-7_29. URL https://doi.org/10.1007/978-3-030-58558-7_29.
  17. Dense-captioning events in videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017.
  18. Pix2struct: Screenshot parsing as pretraining for visual language understanding. arXiv preprint arXiv:2210.03347, 2022.
  19. A new generation of perspective API: Efficient multilingual character-level transformers. arXiv preprint arXiv:2202.11176, 2022.
  20. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pp. 19730–19742. PMLR, 2023. URL https://proceedings.mlr.press/v202/li23q.html.
  21. Referring transformer: A one-step approach to multi-task visual grounding. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pp. 19652–19664, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/a376802c0811f1b9088828288eb0d3f0-Abstract.html.
  22. Widget captioning: Generating natural language description for mobile user interface elements. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  5495–5510, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.443. URL https://aclanthology.org/2020.emnlp-main.443.
  23. Polyformer: Referring image segmentation as sequential polygon generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  18653–18663, 2023.
  24. Deep learning face attributes in the wild. In International Conference on Computer Vision, 2015.
  25. OK-VQA: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pp.  3195–3204, 2019.
  26. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. In Findings of ACL, 2022.
  27. Docvqa: A dataset for vqa on document images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp.  2200–2209, 2021.
  28. Infographicvqa. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp.  1697–1706, 2022.
  29. Ocr-vqa: Visual question answering by reading text in images. In ICDAR, 2019.
  30. Spoken moments: Learning joint audio-visual representations from video descriptions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  14871–14881, 2021.
  31. Ellis Monk. Monk skin tone scale, 2019. URL https://skintone.google.
  32. All in tokens: Unifying output space of visual tasks via soft token. arXiv preprint arXiv:2301.02229, 2023.
  33. Ernie-layout: Layout knowledge enhanced pre-training for visually-rich document understanding. arXiv preprint arXiv:2210.06155, 2022.
  34. Going full-tilt boogie on document understanding with text-image-layout transformer. In Document Analysis and Recognition–ICDAR 2021: 16th International Conference, Lausanne, Switzerland, September 5–10, 2021, Proceedings, Part II 16, pp.  732–747. Springer, 2021.
  35. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pp.  8748–8763. PMLR, 2021. URL http://proceedings.mlr.press/v139/radford21a.html.
  36. Do ImageNet classifiers generalize to ImageNet? In International Conference on Machine Learning, pp. 5389–5400, 2019.
  37. Imagenet large scale visual recognition challenge. CoRR, abs/1409.0575, 2014. URL http://arxiv.org/abs/1409.0575.
  38. LAION-400M: open dataset of clip-filtered 400 million image-text pairs. CoRR, abs/2111.02114, 2021. URL https://arxiv.org/abs/2111.02114.
  39. A step toward more inclusive people annotations for fairness. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society (AIES), 2021.
  40. TextCaps: a dataset for image captioning with reading comprehension. In European conference on computer vision, pp.  742–758, 2020.
  41. Towards VQA models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  8317–8326, 2019.
  42. Ul2: Unifying language learning paradigms, 2023.
  43. Crossmodal-3600: A massively multilingual multimodal evaluation dataset. In Conference on Empirical Methods in Natural Language Processing, 2022.
  44. Image captioners are scalable vision learners too. CoRR, abs/2306.07915, 2023. doi: 10.48550/arXiv.2306.07915. URL https://doi.org/10.48550/arXiv.2306.07915.
  45. Attention is all you need. CoRR, abs/1706.03762, 2017. URL http://arxiv.org/abs/1706.03762.
  46. Screen2words: Automatic mobile ui summarization with multimodal learning. In The 34th Annual ACM Symposium on User Interface Software and Technology, UIST ’21, pp.  498–510, New York, NY, USA, 2021a. Association for Computing Machinery. ISBN 9781450386357. doi: 10.1145/3472749.3474765. URL https://doi.org/10.1145/3472749.3474765.
  47. Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100, 2022a.
  48. Image as a foreign language: Beit pretraining for all vision and vision-language tasks. CoRR, abs/2208.10442, 2022b. doi: 10.48550/arXiv.2208.10442. URL https://doi.org/10.48550/arXiv.2208.10442.
  49. Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  4581–4591, 2019.
  50. Simvlm: Simple visual language model pretraining with weak supervision. arXiv preprint arXiv:2108.10904, 2021b.
  51. Next-qa: Next phase of question-answering to explaining temporal actions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  9777–9786, 2021.
  52. Video question answering via gradually refined attention over appearance and motion. In Proceedings of the 25th ACM international conference on Multimedia, pp.  1645–1653, 2017.
  53. mplug-2: A modularized multi-modal foundation model across text, image and video. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pp. 38728–38748. PMLR, 2023. URL https://proceedings.mlr.press/v202/xu23s.html.
  54. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016.
  55. Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022.
  56. Modeling context in referring expressions. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pp.  69–85. Springer, 2016.
  57. Activitynet-qa: A dataset for understanding complex web videos via question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp.  9127–9134, 2019.
  58. Scaling vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  12104–12113, 2022a.
  59. Lit: Zero-shot transfer with locked-image text tuning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pp. 18102–18112. IEEE, 2022b. doi: 10.1109/CVPR52688.2022.01759. URL https://doi.org/10.1109/CVPR52688.2022.01759.
  60. Sigmoid loss for language image pre-training. In International Conference on Computer Vision, 2023.
  61. Chatbridge: Bridging modalities with large language model as a language catalyst. CoRR, abs/2305.16103, 2023. doi: 10.48550/arXiv.2305.16103. URL https://doi.org/10.48550/arXiv.2305.16103.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (19)
  1. Xi Chen (1035 papers)
  2. Xiao Wang (507 papers)
  3. Lucas Beyer (46 papers)
  4. Alexander Kolesnikov (44 papers)
  5. Jialin Wu (30 papers)
  6. Paul Voigtlaender (24 papers)
  7. Basil Mustafa (32 papers)
  8. Sebastian Goodman (12 papers)
  9. Ibrahim Alabdulmohsin (31 papers)
  10. Piotr Padlewski (9 papers)
  11. Daniel Salz (8 papers)
  12. Xi Xiong (22 papers)
  13. Daniel Vlasic (8 papers)
  14. Filip Pavetic (11 papers)
  15. Keran Rong (9 papers)
  16. Tianli Yu (1 paper)
  17. Daniel Keysers (19 papers)
  18. Xiaohua Zhai (51 papers)
  19. Radu Soricut (54 papers)
Citations (76)
X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com