Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PaLI: A Jointly-Scaled Multilingual Language-Image Model (2209.06794v4)

Published 14 Sep 2022 in cs.CV and cs.CL

Abstract: Effective scaling and a flexible task interface enable LLMs to excel at many tasks. We present PaLI (Pathways Language and Image model), a model that extends this approach to the joint modeling of language and vision. PaLI generates text based on visual and textual inputs, and with this interface performs many vision, language, and multimodal tasks, in many languages. To train PaLI, we make use of large pre-trained encoder-decoder LLMs and Vision Transformers (ViTs). This allows us to capitalize on their existing capabilities and leverage the substantial cost of training them. We find that joint scaling of the vision and language components is important. Since existing Transformers for language are much larger than their vision counterparts, we train a large, 4-billion parameter ViT (ViT-e) to quantify the benefits from even larger-capacity vision models. To train PaLI, we create a large multilingual mix of pretraining tasks, based on a new image-text training set containing 10B images and texts in over 100 languages. PaLI achieves state-of-the-art in multiple vision and language tasks (such as captioning, visual question-answering, scene-text understanding), while retaining a simple, modular, and scalable design.

Overview of PaLI: A Jointly-Scaled Multilingual Language-Image Model

The paper "PaLI: A Jointly-Scaled Multilingual Language-Image Model" introduces a model designed to integrate the capabilities of language and vision models into a unified framework. This approach emphasizes scalability in both vision and language components, leveraging large pre-trained Transformers to enhance performance across various multimodal tasks.

Model Architecture and Key Components

PaLI, or the Pathways Language and Image model, is built upon an encoder-decoder Transformer structure, combining a Vision Transformer (ViT) for visual processing with an mT5 model, which represents its language component. Three main configurations of PaLI are explored: PaLI-3B, PaLI-15B, and PaLI-17B, with parameter sizes reflecting different allocations between vision and language capacities.

  • Vision Component: The paper introduces ViT-e, a 4B-parameter model, which achieves significant performance improvements in vision-language tasks beyond what was possible with previous models like ViT-G.
  • Language Component: The language capacity is sustained using mT5-XXL, a model renowned for its robust language understanding and generation capabilities. This is crucial as it allows the model to maintain these capabilities when scaled to multimodal tasks.

Training and Data

To enable effective training for multilingual scenarios, the authors employ WebLI: a large-scale dataset containing 10 billion images with texts in over 100 languages. The mixture used for training includes several multimodal objectives like text span corruption, split-captioning, OCR tasks, and visual question answering (VQA), ensuring broad task coverage and robust pre-training.

Numerical Results and Performance

PaLI's performance is evaluated over multiple tasks, achieving state-of-the-art results in both monolingual and multilingual settings. Key benchmarks include:

  • Image Captioning: On COCO Captioning and NoCaps, PaLI-17B establishes new records with a CIDEr score of 149.1 and strong performance on out-of-domain data.
  • Visual Question Answering: Achieving SOTA results on VQAv2, even surpassing models that use fixed-vocabulary classification approaches.
  • Zero-shot Image Classification: Demonstrates compelling results on ImageNet and its derivatives, which were achieved without fine-tuning specifically on these datasets.

Implications and Future Directions

The joint scaling approach suggests a strategic direction for future multimodal AI models, emphasizing the importance of equitable scaling across vision and language components. The empirical strengths of ViT-e support its utility in heavily vision-dependent multimodal tasks, while the effective use of mT5 reaffirms the capacity of LLMs when extended to multimodal domains.

Future research may build on this work by exploring even larger vision models or refining the pre-training data mixtures to further enhance task-specific adaptations. The clear improvement across tasks with high multilingual diversity reinforces the potential of models like PaLI in global language settings, a significant step towards more universally applicable AI technologies.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (29)
  1. Xi Chen (1035 papers)
  2. Xiao Wang (507 papers)
  3. Soravit Changpinyo (24 papers)
  4. AJ Piergiovanni (40 papers)
  5. Piotr Padlewski (9 papers)
  6. Daniel Salz (8 papers)
  7. Sebastian Goodman (12 papers)
  8. Adam Grycner (2 papers)
  9. Basil Mustafa (32 papers)
  10. Lucas Beyer (46 papers)
  11. Alexander Kolesnikov (44 papers)
  12. Joan Puigcerver (20 papers)
  13. Nan Ding (57 papers)
  14. Keran Rong (9 papers)
  15. Hassan Akbari (8 papers)
  16. Gaurav Mishra (14 papers)
  17. Linting Xue (9 papers)
  18. Ashish Thapliyal (2 papers)
  19. James Bradbury (20 papers)
  20. Weicheng Kuo (23 papers)
Citations (598)