Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Synth$^2$: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings (2403.07750v2)

Published 12 Mar 2024 in cs.CV and cs.AI
Synth$^2$: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings

Abstract: The creation of high-quality human-labeled image-caption datasets presents a significant bottleneck in the development of Visual-LLMs (VLMs). In this work, we investigate an approach that leverages the strengths of LLMs and image generation models to create synthetic image-text pairs for efficient and effective VLM training. Our method employs a pretrained text-to-image model to synthesize image embeddings from captions generated by an LLM. Despite the text-to-image model and VLM initially being trained on the same data, our approach leverages the image generator's ability to create novel compositions, resulting in synthetic image embeddings that expand beyond the limitations of the original dataset. Extensive experiments demonstrate that our VLM, finetuned on synthetic data achieves comparable performance to models trained solely on human-annotated data, while requiring significantly less data. Furthermore, we perform a set of analyses on captions which reveals that semantic diversity and balance are key aspects for better downstream performance. Finally, we show that synthesizing images in the image embedding space is 25\% faster than in the pixel space. We believe our work not only addresses a significant challenge in VLM training but also opens up promising avenues for the development of self-improving multi-modal models.

Enhancing Visual-LLMs with Synthetic Data Generation

Introduction

The development of Visual-LLMs (VLMs) has been significantly constrained by the limited availability and high costs associated with human-labeled image-caption datasets. In this research, we propose a novel workaround for this bottleneck that leverages the strengths of LLMs and image generation models to efficiently produce synthetic image-text pairs. This approach is demonstrated to facilitate VLM training, offering a new pipeline that generates synthetic datasets with potential for customizable and broad applicability.

Synthetic Data Creation

Our method introduces a mechanism for generating both text and images synthetically, negating the dependency on exhaustive real-world data collection. This process employs LLMs to produce captions from specified classes, which then inform the generation of corresponding image embeddings via a pre-trained text-to-image model. Exceptional care is taken to train this image generator on a specific human-annotated image-caption dataset, ensuring that the training of the VLM occurs in a controlled environment without knowledge transfer from extensive, external sources.

Efficiency in Embedding Space

A notable innovation within our approach is the operation within the image embedding space, rather than relying on computationally heavy pixel-space rendering. By aligning the vision encoder of the VLM with the image generator's VQ-GAN backbone, we bypass the decoding and re-encoding steps, significantly streamlining and accelerating the training process without sacrificing performance quality.

Evaluation and Performance

The efficacy of the proposed method is underpinned by comprehensive experiments. When the VLM is trained on a combination of human-annotated and synthetic data, it demonstrates a considerable performance increase over models trained exclusively on human-annotated datasets. More specifically, we observed a 17\% performance improvement through the integration of a synthetic dataset, validating the potential of synthetic data to augment the learning process of VLMs effectively.

Theoretical and Practical Implications

This research not only tackles the practical limitations related to data availability and resource consumption but also opens new vistas for theoretical advancement in VLM training methodologies. The introduction of a workflow that integrates synthetic data generation effectively expands the horizon for creating large-scale, customized image-text pairs, enhancing the model's learning dynamics and applicability across various domains.

Future Prospects in AI

The implications of this paper extend beyond immediate applications in VLM training, proposing a framework that might accelerate advancements across multiple areas within AI. Looking ahead, it invites further exploration into the scalability of synthetic data creation, the potential for bias mitigation in generative models, and the exploration of diverse, domain-specific text sources. This research marks a pivotal step toward realizing the vast potential of generative AI in the effective training of complex models with reduced dependency on large-scale, real-world datasets.

Conclusion

In summation, this paper introduced a groundbreaking approach for enhancing VLM training efficiency and effectiveness through the generation of synthetic data. By leveraging the generative capacities of LLMs and image generation models, it provides a viable solution to the prevailing challenges of data scarcity, high curation costs, and computational inefficiencies. The consequent performance boosts and the promise of customizable, scalable datasets highlight the significant potential of this method to push the boundaries of what's possible in AI research and applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  2. Synthetic data from diffusion models improves imagenet classification. arXiv preprint arXiv:2304.08466, 2023.
  3. Label-efficient semantic segmentation with diffusion models. arXiv preprint arXiv:2112.03126, 2021.
  4. Going beyond nouns with vision & language models using synthetic data. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  20155–20165, 2023.
  5. Muse: Text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704, 2023.
  6. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  3558–3568, 2021.
  7. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp. 1597–1607. PMLR, 2020a.
  8. Big self-supervised models are strong semi-supervised learners. Advances in neural information processing systems, 33:22243–22255, 2020b.
  9. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
  10. Pali: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794, 2022.
  11. Learning semantic segmentation from synthetic data: A geometrically guided input-output adaptation approach. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  1841–1850, 2019.
  12. Next-generation deep learning based on simulators and synthetic data. Trends in cognitive sciences, 2022.
  13. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  14. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  12873–12883, 2021.
  15. Scaling laws of synthetic images for model training… for now. arXiv preprint arXiv:2312.04567, 2023.
  16. Kubric: A scalable dataset generator. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  3749–3761, 2022.
  17. Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33:21271–21284, 2020.
  18. Learning video representations of human motion from synthetic data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  20197–20207, 2022.
  19. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
  20. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  21. Scaling up vision-language pre-training for image captioning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  17980–17989, 2022.
  22. Perceiver: General perception with iterative attention. In International conference on machine learning, pp. 4651–4664. PMLR, 2021.
  23. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  3128–3137, 2015.
  24. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  25. Semantic segmentation with generative models: Semi-supervised learning and strong out-of-domain generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  8300–8311, 2021.
  26. Bigdatasetgan: Synthesizing imagenet with pixel-wise annotations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  21330–21340, 2022a.
  27. Do dall-e and flamingo understand each other? In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  1999–2010, 2023a.
  28. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pp. 12888–12900. PMLR, 2022b.
  29. Leveraging unpaired data for vision-language generative models via cycle consistency. arXiv preprint arXiv:2310.03734, 2023b.
  30. Data curation for image captioning with text-to-image generative models. arXiv preprint arXiv:2305.03610, 2023c.
  31. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  32. Pretrained diffusion models for unified human motion synthesis. arXiv preprint arXiv:2212.02837, 2022.
  33. Task2sim: Towards effective pre-training and transfer from synthetic data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  9194–9204, 2022.
  34. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pp.  2641–2649, 2015.
  35. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748–8763. PMLR, 2021.
  36. Imagenet-21k pretraining for the masses. arXiv preprint arXiv:2104.10972, 2021.
  37. The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  3234–3243, 2016.
  38. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
  39. Classification by attention: Scene graph classification with prior knowledge. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp.  5025–5033, 2021.
  40. Improving scene graph classification by exploiting knowledge from texts. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp.  2189–2197, 2022.
  41. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  42. Repurposing gans for one-shot semantic part segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  4475–4485, 2021.
  43. Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems, 34:200–212, 2021.
  44. Learning from synthetic humans. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  109–117, 2017.
  45. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  4566–4575, 2015.
  46. Structured3d: A large photo-realistic dataset for structured 3d modeling. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IX 16, pp.  519–535. Springer, 2020.
  47. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp.  2223–2232, 2017.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Sahand Sharifzadeh (18 papers)
  2. Christos Kaplanis (12 papers)
  3. Shreya Pathak (12 papers)
  4. Dharshan Kumaran (9 papers)
  5. Jovana Mitrovic (15 papers)
  6. Charles Blundell (54 papers)
  7. Andrea Banino (11 papers)
  8. Anastasija Ilic (2 papers)
Citations (4)
Reddit Logo Streamline Icon: https://streamlinehq.com