Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 97 tok/s
Gemini 2.5 Pro 39 tok/s Pro
GPT-5 Medium 29 tok/s
GPT-5 High 28 tok/s Pro
GPT-4o 93 tok/s
GPT OSS 120B 462 tok/s Pro
Kimi K2 215 tok/s Pro
2000 character limit reached

SynthVLM: High-Efficiency and High-Quality Synthetic Data for Vision Language Models (2407.20756v4)

Published 30 Jul 2024 in cs.CV and cs.CL

Abstract: Vision-LLMs (VLMs) have recently emerged, demonstrating remarkable vision-understanding capabilities. However, training these models requires large-scale datasets, which brings challenges related to efficiency, effectiveness, quality, and privacy of web data. In this paper, we introduce SynthVLM, a novel data synthesis and curation method for generating image-caption pairs. Unlike traditional methods, where captions are generated from images, SynthVLM utilizes advanced diffusion models and high-quality captions to automatically synthesize and select high-resolution images from text descriptions, thereby creating precisely aligned image-text pairs. To demonstrate the power of SynthVLM, we introduce SynthVLM-100K, a high-quality dataset consisting of 100,000 curated and synthesized image-caption pairs. In both model and human evaluations, SynthVLM-100K outperforms traditional real-world datasets. Leveraging this dataset, we develop a new family of multimodal LLMs (MLLMs), SynthVLM-7B and SynthVLM-13B, which achieve state-of-the-art (SOTA) performance on various vision question-answering (VQA) tasks. Notably, our models outperform LLaVA across most metrics with only 18\% pretrain data. Furthermore, SynthVLM-7B and SynthVLM-13B attain SOTA performance on the MMLU benchmark, demonstrating that the high-quality SynthVLM-100K dataset preserves language abilities. To facilitate future research, our dataset and the complete data generating and curating methods are open-sourced at https://github.com/starriver030515/SynthVLM.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (61)
  1. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. (2023).
  2. A Survey of Multimodal Large Language Model from A Data-centric Perspective. arXiv preprint arXiv:2405.16640 (2024).
  3. ShareGPT4V: Improving Large Multi-Modal Models with Better Captions. CoRR abs/2311.12793 (2023).
  4. Alpagasus: Training a better alpaca with fewer data. arXiv preprint arXiv:2307.08701 (2023).
  5. Lingua manga: A generic large language model centric system for data curation. arXiv preprint arXiv:2306.11702 (2023).
  6. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 24185–24198.
  7. Security and Privacy Challenges of Large Language Models: A Survey. CoRR abs/2402.00888 (2024).
  8. Prafulla Dhariwal and Alexander Quinn Nichol. 2021. Diffusion Models Beat GANs on Image Synthesis. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual. 8780–8794.
  9. Mods: Model-oriented data selection for instruction tuning. arXiv preprint arXiv:2311.15653 (2023).
  10. Reformatted Alignment. CoRR abs/2402.12219 (2024).
  11. How large language models will disrupt data management. Proceedings of the VLDB Endowment 16, 11 (2023), 3302–3309.
  12. DataComp: In search of the next generation of multimodal datasets. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023.
  13. DEBERTA: DECODING-ENHANCED BERT WITH DISENTANGLED ATTENTION. In International Conference on Learning Representations.
  14. CLIPScore: A Reference-free Evaluation Metric for Image Captioning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021. Association for Computational Linguistics, 7514–7528.
  15. Denoising Diffusion Probabilistic Models. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  16. Bootstrapping Vision-Language Learning with Decoupled Language Pre-training. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023.
  17. Otter: A Multi-Modal Model with In-Context Instruction Tuning. CoRR abs/2305.03726 (2023).
  18. Privacy in Large Language Models: Attacks, Defenses and Future Directions. CoRR abs/2310.10383 (2023).
  19. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning. PMLR, 19730–19742.
  20. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, Vol. 162. 12888–12900.
  21. Mvbench: A comprehensive multi-modal video understanding benchmark. arXiv preprint arXiv:2311.17005 (2023).
  22. Grounded Language-Image Pre-training. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022. IEEE, 10955–10965.
  23. Self-Alignment with Instruction Backtranslation. CoRR abs/2308.06259 (2023).
  24. Differentially Private Synthetic Data via Foundation Model APIs 1: Images. CoRR abs/2305.15560 (2023).
  25. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744 (2023).
  26. Visual Instruction Tuning. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023.
  27. Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection. CoRR abs/2303.05499 (2023).
  28. What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning. In The Twelfth International Conference on Learning Representations.
  29. MUFFIN: Curating Multi-Faceted Instructions for Improving Instruction-Following. CoRR abs/2312.02436 (2023).
  30. Lyrics: Boosting Fine-grained Language-Vision Alignment and Comprehension via Semantic-aware Visual Objects. CoRR abs/2312.05278 (2023).
  31. # InsTag: Instruction Tagging for Analyzing Supervised Fine-tuning of Large Language Models. In The Twelfth International Conference on Learning Representations.
  32. Conditional Diffusion Probabilistic Model for Speech Enhancement. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022, Virtual and Singapore, 23-27 May 2022. 7402–7406.
  33. meta llama. 2024. Introducing Meta Llama 3: The most capable openly available LLM to date. https://ai.meta.com/blog/meta-llama-3/ Accessed: 2024-05-02.
  34. Demystifying Data Management for Large Language Models. In Companion of the 2024 International Conference on Management of Data. 547–555.
  35. Improving multimodal datasets with image captioning. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023.
  36. Alexander Quinn Nichol and Prafulla Dhariwal. 2021. Improved Denoising Diffusion Probabilistic Models. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, Vol. 139. 8162–8171.
  37. Flexmoe: Scaling large-scale sparse pre-trained model training via dynamic device placement. Proceedings of the ACM on Management of Data 1, 1 (2023), 1–19.
  38. OpenAI. 2023a. ChatGPT. https://openai.com/blog/chatgpt
  39. R OpenAI. 2023b. GPT-4 technical report. arXiv (2023), 2303–08774.
  40. SelectLLM: Can LLMs Select Important Instructions to Annotate? arXiv preprint arXiv:2401.16553 (2024).
  41. SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis. CoRR abs/2307.01952 (2023).
  42. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
  43. High-Resolution Image Synthesis with Latent Diffusion Models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022. 10674–10685.
  44. Denoising Diffusion Implicit Models. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021.
  45. Dual Diffusion Implicit Bridges for Image-to-Image Translation. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023.
  46. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
  47. Immanuel Trummer. 2023. From BERT to GPT-3 codex: harnessing the potential of very large language models for data management. arXiv preprint arXiv:2306.09339 (2023).
  48. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079 (2023).
  49. Finetuned Multimodal Language Models Are High-Quality Image-Text Data Filters. CoRR abs/2403.02677 (2024).
  50. Internvideo2: Scaling video foundation models for multimodal video understanding. arXiv preprint arXiv:2403.15377 (2024).
  51. Do Generated Data Always Help Contrastive Learning? CoRR abs/2403.12448 (2024).
  52. Magicoder: Source Code Is All You Need. CoRR abs/2312.02120 (2023).
  53. Multimodal large language models: A survey. arXiv preprint arXiv:2311.13165 (2023).
  54. Differentially Private Synthetic Data via Foundation Model APIs 2: Text. CoRR abs/2403.01749 (2024).
  55. Baize: An Open-Source Chat Model with Parameter-Efficient Tuning on Self-Chat Data. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023. 6268–6278.
  56. Rethinking the Instruction Quality: LIFT is What You Need. arXiv:2312.11508 [cs.CL]
  57. RefGPT: Dialogue Generation of GPT, by GPT, and for GPT. In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023. Association for Computational Linguistics, 2511–2535.
  58. Mm-llms: Recent advances in multimodal large language models. arXiv preprint arXiv:2401.13601 (2024).
  59. GLIPv2: Unifying Localization and Vision-Language Understanding. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022.
  60. A survey of large language models. arXiv preprint arXiv:2303.18223 (2023).
  61. A-SDM: Accelerating Stable Diffusion through Model Assembly and Feature Inheritance Strategies. CoRR abs/2406.00210 (2024).
Citations (3)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run paper prompts using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets