SynthVLM: High-Efficiency and High-Quality Synthetic Data for Vision Language Models (2407.20756v4)

Published 30 Jul 2024 in cs.CV and cs.CL

Abstract: Vision-LLMs (VLMs) have recently emerged, demonstrating remarkable vision-understanding capabilities. However, training these models requires large-scale datasets, which brings challenges related to efficiency, effectiveness, quality, and privacy of web data. In this paper, we introduce SynthVLM, a novel data synthesis and curation method for generating image-caption pairs. Unlike traditional methods, where captions are generated from images, SynthVLM utilizes advanced diffusion models and high-quality captions to automatically synthesize and select high-resolution images from text descriptions, thereby creating precisely aligned image-text pairs. To demonstrate the power of SynthVLM, we introduce SynthVLM-100K, a high-quality dataset consisting of 100,000 curated and synthesized image-caption pairs. In both model and human evaluations, SynthVLM-100K outperforms traditional real-world datasets. Leveraging this dataset, we develop a new family of multimodal LLMs (MLLMs), SynthVLM-7B and SynthVLM-13B, which achieve state-of-the-art (SOTA) performance on various vision question-answering (VQA) tasks. Notably, our models outperform LLaVA across most metrics with only 18\% pretrain data. Furthermore, SynthVLM-7B and SynthVLM-13B attain SOTA performance on the MMLU benchmark, demonstrating that the high-quality SynthVLM-100K dataset preserves language abilities. To facilitate future research, our dataset and the complete data generating and curating methods are open-sourced at https://github.com/starriver030515/SynthVLM.

References (61)

Citations (3)

View on Semantic Scholar

Collections

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Paper Prompts

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

SynthVLM: High-Efficiency and High-Quality Synthetic Data for Vision Language Models (2407.20756v4)

Collections

Summary

Paper Prompts

Follow-up Questions

Authors (9)

Tweets

SynthVLM: High-Efficiency and High-Quality Synthetic Data for Vision Language Models (2407.20756v4)

Collections

Summary

Paper Prompts

Follow-up Questions

Related Papers

Authors (9)

Tweets