MLLMs-Augmented Visual-Language Representation Learning (2311.18765v3)
Abstract: Visual-language pre-training has achieved remarkable success in many multi-modal tasks, largely attributed to the availability of large-scale image-text datasets. In this work, we demonstrate that Multi-modal LLMs (MLLMs) can enhance visual-language representation learning by establishing richer image-text associations for image-text datasets. Our approach is simple, utilizing MLLMs to extend multiple diverse captions for each image. To prevent the bias introduced by MLLMs' hallucinations and monotonous language styles, we propose "text shearing" to maintain the quality and availability of extended captions. In image-text retrieval, without introducing additional training cost, our method consistently obtains 5.6 ~ 35.0 and 16.8 ~ 46.1 improvement on Recall@1 under the fine-tuning and zero-shot settings, respectively. Notably, we obtain zero-shot results that are comparable to fine-tuning on target datasets, which encourages more exploration of the versatile use of MLLMs.
- Semdedup: Data-efficient learning at web-scale through semantic deduplication. arXiv preprint arXiv:2303.09540, 2023.
- Nocaps: Novel object captioning at scale. In ICCV, pages 8948–8957, 2019.
- Flamingo: a visual language model for few-shot learning. Neurips, 35:23716–23736, 2022.
- Vqa: Visual question answering. In ICCV, pages 2425–2433, 2015.
- Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
- Food-101–mining discriminative components with random forests. In ECCV, pages 446–461. Springer, 2014.
- Less is more: Removing text-regions improves clip training efficiency and robustness. arXiv preprint arXiv:2305.05095, 2023.
- Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, pages 3558–3568, 2021.
- Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE, 105(10):1865–1883, 2017.
- Describing textures in the wild. In CVPR, pages 3606–3613, 2014.
- An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 215–223. JMLR Workshop and Conference Proceedings, 2011.
- Randaugment: Practical automated data augmentation with a reduced search space. In CVPR workshops, pages 702–703, 2020.
- Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
- Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255. Ieee, 2009.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Improving clip training with language rewrites. arXiv preprint arXiv:2305.20088, 2023.
- Data determines distributional robustness in contrastive language image pre-training (clip). In ICML, pages 6216–6234. PMLR, 2022.
- One-shot learning of object categories. PAMI, 28(4):594–611, 2006.
- Datacomp: In search of the next generation of multimodal datasets. arXiv preprint arXiv:2304.14108, 2023.
- Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010, 2023.
- Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, pages 6904–6913, 2017.
- Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7):2217–2226, 2019.
- Openclip, 2021. If you use this software, please cite it as below.
- Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, pages 4904–4916. PMLR, 2021.
- Vilt: Vision-and-language transformer without convolution or region supervision. In ICML, pages 5583–5594. PMLR, 2021.
- 3d object representations for fine-grained categorization. In ICCV workshops, pages 554–561, 2013.
- Learning multiple layers of features from tiny images. 2009.
- From scarcity to efficiency: Improving clip training via visual-enriched captions. arXiv preprint arXiv:2310.07699, 2023.
- Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023a.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, pages 12888–12900. PMLR, 2022.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023b.
- Scaling language-image pre-training via masking. In CVPR, pages 23390–23400, 2023c.
- Microsoft coco: Common objects in context. In ECCV, pages 740–755. Springer, 2014.
- Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023a.
- Visual instruction tuning, 2023b.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- T-mars: Improving visual representations by circumventing text feature learning. arXiv preprint arXiv:2307.03132, 2023.
- Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013.
- Ok-vqa: A visual question answering benchmark requiring external knowledge. In CVPR, pages 3195–3204, 2019.
- Slip: Self-supervision meets language-image pre-training. In ECCV, pages 529–544. Springer, 2022.
- Quality not quantity: On the interaction between dataset design and robustness of clip. Neurips, 35:21455–21469, 2022.
- Improving multimodal datasets with image captioning. arXiv preprint arXiv:2307.10350, 2023.
- Automated flower classification over a large number of classes. In 2008 Sixth Indian conference on computer vision, graphics & image processing, pages 722–729. IEEE, 2008.
- OpenAI. Introducing chatgpt. https://openai.com/blog/chatgpt, 2022.
- Cats and dogs. In CVPR, pages 3498–3505. IEEE, 2012.
- Pytorch: An imperative style, high-performance deep learning library. Neurips, 32, 2019.
- Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In ICCV, pages 2641–2649, 2015.
- Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763. PMLR, 2021.
- Do imagenet classifiers generalize to imagenet? In ICML, pages 5389–5400. PMLR, 2019.
- Is a caption worth a thousand images? a controlled study for representation learning. arXiv preprint arXiv:2207.07635, 2022.
- Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
- A-okvqa: A benchmark for visual question answering using world knowledge. In ECCV, pages 146–162. Springer, 2022.
- Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, pages 2556–2565, 2018.
- Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580, 2023.
- The german traffic sign recognition benchmark: a multi-class classification competition. In IJCNN, pages 1453–1460. IEEE, 2011.
- A corpus for reasoning about natural language grounded in photographs. arXiv preprint arXiv:1811.00491, 2018.
- Yfcc100m: The new data in multimedia research. Communications of the ACM, 59(2):64–73, 2016.
- Training data-efficient image transformers & distillation through attention. In ICML, pages 10347–10357. PMLR, 2021.
- Too large; data reduction for vision-language pre-training. arXiv preprint arXiv:2305.20087, 2023.
- Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671, 2023a.
- Next-gpt: Any-to-any multimodal llm, 2023b.
- Sun database: Large-scale scene recognition from abbey to zoo. In CVPR, pages 3485–3492. IEEE, 2010.
- Msr-vtt: A large video description dataset for bridging video and language. In CVPR, pages 5288–5296, 2016.
- A survey on multimodal large language models. arXiv preprint arXiv:2306.13549, 2023.
- Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022.
- Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023a.
- Multimodal chain-of-thought reasoning in language models. arXiv preprint arXiv:2302.00923, 2023b.
- Chatgpt asks, blip-2 answers: Automatic questioning towards enriched visual descriptions. arXiv preprint arXiv:2303.06594, 2023a.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023b.
- Yanqing Liu (48 papers)
- Kai Wang (624 papers)
- Wenqi Shao (89 papers)
- Ping Luo (340 papers)
- Yu Qiao (563 papers)
- Mike Zheng Shou (165 papers)
- Kaipeng Zhang (73 papers)
- Yang You (173 papers)