Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models (2301.12597v3)

Published 30 Jan 2023 in cs.CV
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Abstract: The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models. This paper proposes BLIP-2, a generic and efficient pre-training strategy that bootstraps vision-language pre-training from off-the-shelf frozen pre-trained image encoders and frozen LLMs. BLIP-2 bridges the modality gap with a lightweight Querying Transformer, which is pre-trained in two stages. The first stage bootstraps vision-language representation learning from a frozen image encoder. The second stage bootstraps vision-to-language generative learning from a frozen LLM. BLIP-2 achieves state-of-the-art performance on various vision-language tasks, despite having significantly fewer trainable parameters than existing methods. For example, our model outperforms Flamingo80B by 8.7% on zero-shot VQAv2 with 54x fewer trainable parameters. We also demonstrate the model's emerging capabilities of zero-shot image-to-text generation that can follow natural language instructions.

This paper introduces BLIP-2, a novel vision-language pre-training (VLP) framework designed to be efficient and effective by leveraging readily available, pre-trained unimodal models. The core challenge addressed is the prohibitive computational cost associated with traditional end-to-end VLP methods that train large vision and LLMs from scratch.

BLIP-2 proposes using frozen pre-trained image encoders and frozen LLMs, significantly reducing the number of trainable parameters and pre-training time. The key innovation is the Querying Transformer (Q-Former), a lightweight module designed to bridge the modality gap between the frozen image encoder and the frozen LLM.

The Q-Former employs a set of learnable query embeddings to interact with the frozen image encoder via cross-attention, extracting visual features relevant to the language modality. It acts as an information bottleneck, feeding concise and relevant visual information to the LLM.

BLIP-2's pre-training strategy for the Q-Former consists of two stages:

  1. Vision-Language Representation Learning: The Q-Former is connected to the frozen image encoder and trained on image-text pairs. It optimizes three objectives simultaneously:
    • Image-Text Contrastive Learning (ITC): Aligns image and text representations by maximizing mutual information using contrastive loss.
    • Image-grounded Text Generation (ITG): Trains the Q-Former to output visual representations that enable the generation of the associated text. This forces the queries to capture text-relevant visual information.
    • Image-Text Matching (ITM): A binary classification task to predict if an image-text pair matches, learning fine-grained alignment. This stage teaches the Q-Former to extract visual features that are most informative for the corresponding text.
  2. Vision-to-Language Generative Learning: The Q-Former (with its connected frozen image encoder) is connected to a frozen LLM. A linear layer projects the Q-Former's output query embeddings to match the LLM's input dimension. These projected embeddings act as soft visual prompts prepended to the text input, conditioning the LLM's generation. This stage trains the Q-Former (and the projection layer) so that its visual representations can be interpreted by the frozen LLM, using standard LLMing objectives (e.g., causal LM for decoder models like OPT, prefix LM for encoder-decoder models like FlanT5).

Key Contributions and Findings:

  • Efficiency: BLIP-2 drastically reduces the number of trainable parameters compared to end-to-end methods (e.g., 54x fewer than Flamingo80B while outperforming it on zero-shot VQAv2). Pre-training is significantly faster.
  • Performance: Achieves state-of-the-art results on various vision-language tasks, including zero-shot VQA, image captioning (especially on out-of-domain datasets like NoCaps), and image-text retrieval.
  • Emerging Capabilities: Enables zero-shot instructed image-to-text generation by prompting the frozen LLM with both the visual prompt (from Q-Former) and a text instruction. This allows for tasks like visual conversation, visual knowledge reasoning, and personalized captioning.
  • Modularity and Scalability: Demonstrates that performance improves when using stronger frozen image encoders (e.g., ViT-g vs. ViT-L) or larger/better LLMs (e.g., FlanT5 vs. OPT, larger variants within families), validating its ability to leverage future advances in unimodal models.
  • Importance of Representation Learning: Ablation studies show that the first pre-training stage is crucial for effective learning and preventing catastrophic forgetting in the LLM during the second stage.

Implementation Details:

  • The Q-Former uses a BERT-base initialization (188M parameters) with randomly initialized cross-attention layers. It typically uses 32 learnable queries.
  • Experiments utilize ViT-L/14 and ViT-g/14 as frozen image encoders and OPT and FlanT5 families as frozen LLMs.
  • Training uses standard large-scale image-text datasets (COCO, VG, CC, SBU, LAION) with caption filtering.

Limitations:

  • The model did not exhibit strong in-context few-shot learning capabilities, potentially due to the pre-training data format (single image-text pairs).
  • It inherits potential risks from the frozen LLMs, such as generating inaccurate, biased, or harmful content.

In conclusion, BLIP-2 presents an efficient and effective method for vision-language pre-training by bootstrapping from frozen unimodal models via a lightweight Q-Former trained in two stages, achieving strong performance and enabling novel zero-shot instructed generation capabilities.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. nocaps: novel object captioning at scale. In ICCV, pp.  8947–8956, 2019.
  2. Flamingo: a visual language model for few-shot learning. arXiv preprint arXiv:2204.14198, 2022.
  3. Language models are few-shot learners. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), NeurIPS, 2020.
  4. Conceptual 12M: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, 2021.
  5. Visualgpt: Data-efficient adaptation of pretrained language models for image captioning. In CVPR, pp.  18009–18019, 2022a.
  6. Pali: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794, 2022b.
  7. UNITER: universal image-text representation learning. In ECCV, volume 12375, pp.  104–120, 2020.
  8. Unifying vision-and-language tasks via text generation. arXiv preprint arXiv:2102.02779, 2021.
  9. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  10. Enabling multimodal generation on CLIP via vision-language knowledge distillation. In Muresan, S., Nakov, P., and Villavicencio, A. (eds.), ACL Findings, pp.  2383–2395, 2022.
  11. BERT: pre-training of deep bidirectional transformers for language understanding. In Burstein, J., Doran, C., and Solorio, T. (eds.), NAACL, pp.  4171–4186, 2019.
  12. Unified language model pre-training for natural language understanding and generation. In Wallach, H. M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E. B., and Garnett, R. (eds.), NeurIPS, pp.  13042–13054, 2019.
  13. Eva: Exploring the limits of masked visual representation learning at scale. arXiv preprint arXiv:2211.07636, 2022.
  14. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In CVPR, pp.  6325–6334, 2017.
  15. From images to textual prompts: Zero-shot VQA with frozen large language models. In CVPR, 2022.
  16. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  17. GQA: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, pp.  6700–6709, 2019.
  18. Scaling up visual and vision-language representation learning with noisy text supervision. arXiv preprint arXiv:2102.05918, 2021.
  19. A good prompt is worth millions of parameters: Low-resource prompt-based learning for vision-language models. In Muresan, S., Nakov, P., and Villavicencio, A. (eds.), ACL, pp.  2763–2775, 2022.
  20. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 123(1):32–73, 2017.
  21. Align before fuse: Vision and language representation learning with momentum distillation. In NeurIPS, 2021.
  22. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, pp.  12888–12900, 2022.
  23. Oscar: Object-semantics aligned pre-training for vision-language tasks. In ECCV, pp.  121–137, 2020.
  24. Microsoft COCO: common objects in context. In Fleet, D. J., Pajdla, T., Schiele, B., and Tuytelaars, T. (eds.), ECCV, volume 8693, pp.  740–755, 2014.
  25. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  26. MAPL: parameter-efficient adaptation of unimodal pre-trained models for vision-language few-shot prompting. In EACL, 2023.
  27. Ok-vqa: A visual question answering benchmark requiring external knowledge. In CVPR, 2019.
  28. Im2text: Describing images using 1 million captioned photographs. In Shawe-Taylor, J., Zemel, R. S., Bartlett, P. L., Pereira, F. C. N., and Weinberger, K. Q. (eds.), NIPS, pp.  1143–1151, 2011.
  29. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In ICCV, pp.  2641–2649, 2015.
  30. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020, 2021.
  31. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
  32. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Gurevych, I. and Miyao, Y. (eds.), ACL, pp.  2556–2565, 2018.
  33. LXMERT: learning cross-modality encoder representations from transformers. In Inui, K., Jiang, J., Ng, V., and Wan, X. (eds.), EMNLP, pp.  5099–5110, 2019.
  34. Plug-and-play VQA: zero-shot VQA by conjoining large pretrained models with zero training. In EMNLP Findings, 2022.
  35. Multimodal few-shot learning with frozen language models. In Ranzato, M., Beygelzimer, A., Dauphin, Y. N., Liang, P., and Vaughan, J. W. (eds.), NeurIPS, pp.  200–212, 2021.
  36. OFA: unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvári, C., Niu, G., and Sabato, S. (eds.), ICML, pp.  23318–23340, 2022a.
  37. Vlmo: Unified vision-language pre-training with mixture-of-modality-experts. arXiv preprint arXiv:2111.02358, 2021a.
  38. Image as a foreign language: Beit pretraining for all vision and vision-language tasks. arXiv preprint arXiv:2208.10442, 2022b.
  39. Simvlm: Simple visual language model pretraining with weak supervision. arXiv preprint arXiv:2108.10904, 2021b.
  40. FILIP: fine-grained interactive language-image pre-training. In ICLR, 2022.
  41. Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022.
  42. Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021.
  43. Lit: Zero-shot transfer with locked-image text tuning. In CVPR, pp.  18102–18112, 2022.
  44. Vinvl: Making visual representations matter in vision-language models. arXiv preprint arXiv:2101.00529, 2021.
  45. OPT: open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Junnan Li (56 papers)
  2. Dongxu Li (40 papers)
  3. Silvio Savarese (200 papers)
  4. Steven Hoi (38 papers)
Citations (3,282)
Youtube Logo Streamline Icon: https://streamlinehq.com