Bootstrapping Vision-Language Learning with Decoupled Language Pre-training

Published 13 Jul 2023 in cs.CV and cs.LG | (2307.07063v4)

Abstract: We present a novel methodology aimed at optimizing the application of frozen LLMs for resource-intensive vision-language (VL) pre-training. The current paradigm uses visual features as prompts to guide LLMs, with a focus on determining the most relevant visual features for corresponding text. Our approach diverges by concentrating on the language component, specifically identifying the optimal prompts to align with visual features. We introduce the Prompt-Transformer (P-Former), a model that predicts these ideal prompts, which is trained exclusively on linguistic data, bypassing the need for image-text pairings. This strategy subtly bifurcates the end-to-end VL training process into an additional, separate stage. Our experiments reveal that our framework significantly enhances the performance of a robust image-to-text baseline (BLIP-2), and effectively narrows the performance gap between models trained with either 4M or 129M image-text pairs. Importantly, our framework is modality-agnostic and flexible in terms of architectural design, as validated by its successful application in a video learning task using varied base modules. The code will be made available at https://github.com/yiren-jian/BLIText.

Abstract PDF HTML Upgrade to Chat

Citations (16)

View on Semantic Scholar

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Continue Learning

We haven't generated follow-up questions for this paper yet.

Generate Now

Authors (3)

Collections

GitHub

GitHub - yiren-jian/BLIText: [NeurIPS 2023] Bootstrapping Vision-Language Learning with Decoupled Language Pre-training (27 stars)

Bootstrapping Vision-Language Learning with Decoupled Language Pre-training

Summary

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Related Papers

Authors (3)

Collections

GitHub