SynthVLM: High-Efficiency and High-Quality Synthetic Data for Vision Language Models (2407.20756v4)
Abstract: Vision-LLMs (VLMs) have recently emerged, demonstrating remarkable vision-understanding capabilities. However, training these models requires large-scale datasets, which brings challenges related to efficiency, effectiveness, quality, and privacy of web data. In this paper, we introduce SynthVLM, a novel data synthesis and curation method for generating image-caption pairs. Unlike traditional methods, where captions are generated from images, SynthVLM utilizes advanced diffusion models and high-quality captions to automatically synthesize and select high-resolution images from text descriptions, thereby creating precisely aligned image-text pairs. To demonstrate the power of SynthVLM, we introduce SynthVLM-100K, a high-quality dataset consisting of 100,000 curated and synthesized image-caption pairs. In both model and human evaluations, SynthVLM-100K outperforms traditional real-world datasets. Leveraging this dataset, we develop a new family of multimodal LLMs (MLLMs), SynthVLM-7B and SynthVLM-13B, which achieve state-of-the-art (SOTA) performance on various vision question-answering (VQA) tasks. Notably, our models outperform LLaVA across most metrics with only 18\% pretrain data. Furthermore, SynthVLM-7B and SynthVLM-13B attain SOTA performance on the MMLU benchmark, demonstrating that the high-quality SynthVLM-100K dataset preserves language abilities. To facilitate future research, our dataset and the complete data generating and curating methods are open-sourced at https://github.com/starriver030515/SynthVLM.
- Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. (2023).
- A Survey of Multimodal Large Language Model from A Data-centric Perspective. arXiv preprint arXiv:2405.16640 (2024).
- ShareGPT4V: Improving Large Multi-Modal Models with Better Captions. CoRR abs/2311.12793 (2023).
- Alpagasus: Training a better alpaca with fewer data. arXiv preprint arXiv:2307.08701 (2023).
- Lingua manga: A generic large language model centric system for data curation. arXiv preprint arXiv:2306.11702 (2023).
- Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 24185–24198.
- Security and Privacy Challenges of Large Language Models: A Survey. CoRR abs/2402.00888 (2024).
- Prafulla Dhariwal and Alexander Quinn Nichol. 2021. Diffusion Models Beat GANs on Image Synthesis. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual. 8780–8794.
- Mods: Model-oriented data selection for instruction tuning. arXiv preprint arXiv:2311.15653 (2023).
- Reformatted Alignment. CoRR abs/2402.12219 (2024).
- How large language models will disrupt data management. Proceedings of the VLDB Endowment 16, 11 (2023), 3302–3309.
- DataComp: In search of the next generation of multimodal datasets. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023.
- DEBERTA: DECODING-ENHANCED BERT WITH DISENTANGLED ATTENTION. In International Conference on Learning Representations.
- CLIPScore: A Reference-free Evaluation Metric for Image Captioning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021. Association for Computational Linguistics, 7514–7528.
- Denoising Diffusion Probabilistic Models. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
- Bootstrapping Vision-Language Learning with Decoupled Language Pre-training. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023.
- Otter: A Multi-Modal Model with In-Context Instruction Tuning. CoRR abs/2305.03726 (2023).
- Privacy in Large Language Models: Attacks, Defenses and Future Directions. CoRR abs/2310.10383 (2023).
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning. PMLR, 19730–19742.
- BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, Vol. 162. 12888–12900.
- Mvbench: A comprehensive multi-modal video understanding benchmark. arXiv preprint arXiv:2311.17005 (2023).
- Grounded Language-Image Pre-training. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022. IEEE, 10955–10965.
- Self-Alignment with Instruction Backtranslation. CoRR abs/2308.06259 (2023).
- Differentially Private Synthetic Data via Foundation Model APIs 1: Images. CoRR abs/2305.15560 (2023).
- Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744 (2023).
- Visual Instruction Tuning. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023.
- Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection. CoRR abs/2303.05499 (2023).
- What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning. In The Twelfth International Conference on Learning Representations.
- MUFFIN: Curating Multi-Faceted Instructions for Improving Instruction-Following. CoRR abs/2312.02436 (2023).
- Lyrics: Boosting Fine-grained Language-Vision Alignment and Comprehension via Semantic-aware Visual Objects. CoRR abs/2312.05278 (2023).
- # InsTag: Instruction Tagging for Analyzing Supervised Fine-tuning of Large Language Models. In The Twelfth International Conference on Learning Representations.
- Conditional Diffusion Probabilistic Model for Speech Enhancement. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022, Virtual and Singapore, 23-27 May 2022. 7402–7406.
- meta llama. 2024. Introducing Meta Llama 3: The most capable openly available LLM to date. https://ai.meta.com/blog/meta-llama-3/ Accessed: 2024-05-02.
- Demystifying Data Management for Large Language Models. In Companion of the 2024 International Conference on Management of Data. 547–555.
- Improving multimodal datasets with image captioning. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023.
- Alexander Quinn Nichol and Prafulla Dhariwal. 2021. Improved Denoising Diffusion Probabilistic Models. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, Vol. 139. 8162–8171.
- Flexmoe: Scaling large-scale sparse pre-trained model training via dynamic device placement. Proceedings of the ACM on Management of Data 1, 1 (2023), 1–19.
- OpenAI. 2023a. ChatGPT. https://openai.com/blog/chatgpt
- R OpenAI. 2023b. GPT-4 technical report. arXiv (2023), 2303–08774.
- SelectLLM: Can LLMs Select Important Instructions to Annotate? arXiv preprint arXiv:2401.16553 (2024).
- SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis. CoRR abs/2307.01952 (2023).
- Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
- High-Resolution Image Synthesis with Latent Diffusion Models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022. 10674–10685.
- Denoising Diffusion Implicit Models. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021.
- Dual Diffusion Implicit Bridges for Image-to-Image Translation. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
- Immanuel Trummer. 2023. From BERT to GPT-3 codex: harnessing the potential of very large language models for data management. arXiv preprint arXiv:2306.09339 (2023).
- Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079 (2023).
- Finetuned Multimodal Language Models Are High-Quality Image-Text Data Filters. CoRR abs/2403.02677 (2024).
- Internvideo2: Scaling video foundation models for multimodal video understanding. arXiv preprint arXiv:2403.15377 (2024).
- Do Generated Data Always Help Contrastive Learning? CoRR abs/2403.12448 (2024).
- Magicoder: Source Code Is All You Need. CoRR abs/2312.02120 (2023).
- Multimodal large language models: A survey. arXiv preprint arXiv:2311.13165 (2023).
- Differentially Private Synthetic Data via Foundation Model APIs 2: Text. CoRR abs/2403.01749 (2024).
- Baize: An Open-Source Chat Model with Parameter-Efficient Tuning on Self-Chat Data. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023. 6268–6278.
- Rethinking the Instruction Quality: LIFT is What You Need. arXiv:2312.11508 [cs.CL]
- RefGPT: Dialogue Generation of GPT, by GPT, and for GPT. In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023. Association for Computational Linguistics, 2511–2535.
- Mm-llms: Recent advances in multimodal large language models. arXiv preprint arXiv:2401.13601 (2024).
- GLIPv2: Unifying Localization and Vision-Language Understanding. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022.
- A survey of large language models. arXiv preprint arXiv:2303.18223 (2023).
- A-SDM: Accelerating Stable Diffusion through Model Assembly and Feature Inheritance Strategies. CoRR abs/2406.00210 (2024).
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run paper prompts using GPT-5.