Little Giants: Synthesizing High-Quality Embedding Data at Scale (2410.18634v2)
Abstract: Synthetic data generation has become an increasingly popular way of training models without the need for large, manually labeled datasets. For tasks like text embedding, synthetic data offers diverse and scalable training examples, significantly reducing the cost of human annotation. However, most current approaches rely heavily on proprietary models like GPT-4, which are expensive and inefficient for generating large-scale embedding data. In this paper, we introduce SPEED, a framework that aligns open-source small models (8B) to efficiently generate large-scale synthetic embedding data. Through supervised fine-tuning, preference optimization, and self-improvement, SPEED enables small open-source models to produce high-quality data. Remarkably, SPEED uses only less than 1/10 of the GPT API calls, outperforming the state-of-the-art embedding model E5_mistral when both are trained solely on their synthetic data. Using this efficient generator, we conduct a comprehensive study on how various factors within the alignment pipeline impact data quality and reveal the scaling law for synthetic embedding data.
- AI Anthropic. 2024. The claude 3 model family: Opus, sonnet, haiku. Claude-3 Model Card, 1.
- Smaller, weaker, yet better: Training llm reasoners via compute-optimal sampling. arXiv preprint arXiv:2408.16737.
- Large language monkeys: Scaling inference compute with repeated sampling. CoRR, abs/2407.21787.
- Scaling synthetic data creation with 1,000,000,000 personas. CoRR, abs/2406.20094.
- Edward Y Chang. 2023. Examining gpt-4: Capabilities, implications and future directions. In The 10th International Conference on Computational Science and Computational Intelligence.
- Generalizing conversational dense retrieval via llm-cognition data augmentation. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, pages 2700–2718. Association for Computational Linguistics.
- Towards effective and efficient continual pre-training of large language models. CoRR, abs/2407.18743.
- Promptagator: Few-shot dense retrieval from 8 examples. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
- Self-play with execution feedback: Improving instruction-following capabilities of large language models. CoRR, abs/2406.13542.
- The llama 3 herd of models. CoRR, abs/2407.21783.
- Scaling laws for dense retrieval. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2024, Washington DC, USA, July 14-18, 2024, pages 1339–1349. ACM.
- Textbooks are all you need. CoRR, abs/2306.11644.
- Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
- Qwen2. 5-coder technical report. arXiv preprint arXiv:2409.12186.
- Inpars-v2: Large language models as efficient dataset generators for information retrieval. CoRR, abs/2301.01820.
- Mistral 7b. CoRR, abs/2310.06825.
- Scaling laws for neural language models. CoRR, abs/2001.08361.
- Gecko: Versatile text embeddings distilled from large language models. CoRR, abs/2403.20327.
- Synthetic data (almost) from scratch: Generalized instruction tuning for language models. CoRR, abs/2402.13064.
- Data generation using large language models for text classification: An empirical case study. CoRR, abs/2407.12813.
- Towards general text embeddings with multi-stage contrastive learning. CoRR, abs/2308.03281.
- Best practices and lessons learned on synthetic data for language models. CoRR, abs/2404.07503.
- Fine-tuning llama for multi-stage text retrieval. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2024, Washington DC, USA, July 14-18, 2024, pages 2421–2425. ACM.
- Gemma: Open models based on gemini research and technology. CoRR, abs/2403.08295.
- Meta. 2024. Introducing meta llama 3: The most capable openly available llm to date. https://ai.meta.com/blog/meta-llama-3/.
- MTEB: massive text embedding benchmark. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023, Dubrovnik, Croatia, May 2-6, 2023, pages 2006–2029. Association for Computational Linguistics.
- Large dual encoders are generalizable retrievers. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 9844–9855. Association for Computational Linguistics.
- Document expansion by query prediction. CoRR, abs/1904.08375.
- OpenAI. 2023. GPT-4 technical report. CoRR, abs/2303.08774.
- Enhancing low-resource llms classification with PEFT and synthetic data. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC/COLING 2024, 20-25 May, 2024, Torino, Italy, pages 6017–6023. ELRA and ICCL.
- Team Qwen. 2024. Qwen2.5: A party of foundation models.
- Direct preference optimization: Your language model is secretly a reward model. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023.
- Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pages 3980–3990. Association for Computational Linguistics.
- Proximal policy optimization algorithms. CoRR, abs/1707.06347.
- jina-embeddings-v3: Multilingual embeddings with task lora. arXiv preprint arXiv:2409.10173.
- Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288.
- Prompt2model: Generating deployable models from natural language instructions. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023 - System Demonstrations, Singapore, December 6-10, 2023, pages 413–421. Association for Computational Linguistics.
- Text embeddings by weakly-supervised contrastive pre-training. CoRR, abs/2212.03533.
- Improving text embeddings with large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, pages 11897–11916. Association for Computational Linguistics.
- Query2doc: Query expansion with large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 9414–9423. Association for Computational Linguistics.
- C-pack: Packed resources for general chinese embeddings. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2024, Washington DC, USA, July 14-18, 2024, pages 641–649. ACM.
- Direct alignment of language models via quality-aware self-refinement. CoRR, abs/2405.21040.
- Large language model as attributed training data generator: A tale of diversity and bias. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023.
- Enhancing logical reasoning in large language models through graph-based synthetic data. arXiv preprint arXiv:2409.12437.
- Jiuzhang3.0: Efficiently improving mathematical reasoning by training small data synthesis models. CoRR, abs/2405.14365.
Collections
Sign up for free to add this paper to one or more collections.