Little Giants: Synthesizing High-Quality Embedding Data at Scale (2410.18634v2)

Published 24 Oct 2024 in cs.CL, cs.AI, and cs.IR

Abstract: Synthetic data generation has become an increasingly popular way of training models without the need for large, manually labeled datasets. For tasks like text embedding, synthetic data offers diverse and scalable training examples, significantly reducing the cost of human annotation. However, most current approaches rely heavily on proprietary models like GPT-4, which are expensive and inefficient for generating large-scale embedding data. In this paper, we introduce SPEED, a framework that aligns open-source small models (8B) to efficiently generate large-scale synthetic embedding data. Through supervised fine-tuning, preference optimization, and self-improvement, SPEED enables small open-source models to produce high-quality data. Remarkably, SPEED uses only less than 1/10 of the GPT API calls, outperforming the state-of-the-art embedding model E5_mistral when both are trained solely on their synthetic data. Using this efficient generator, we conduct a comprehensive study on how various factors within the alignment pipeline impact data quality and reveal the scaling law for synthetic embedding data.

References (45)

Citations (1)

View on Semantic Scholar

Collections

Sign up for free to add this paper to one or more collections.

Sign Up

Summary

The paper introduces the SPEED framework that employs a multi-stage approach to efficiently synthesize high-quality text embeddings.
It details a training protocol combining supervised fine-tuning with Direct Preference Optimization to refine data quality using small, open-source models.
The paper demonstrates that aligned small models can achieve competitive benchmark performance while using significantly fewer costly API calls.

High-Quality Embedding Data Synthesis with SPEED Framework

The paper entitled "Little Giants: Synthesizing High-Quality Embedding Data at Scale" explores an innovative framework, SPEED, designed to efficiently generate synthetic embedding data using small, open-source LLMs. This approach addresses the typically high costs associated with proprietary models like GPT-4.

The authors present SPEED as a comprehensive framework utilizing open-source models to alleviate the dependence on expensive proprietary systems while maintaining, or even surpassing, data quality in generating large-scale text embeddings. The SPEED system is structured around a multi-faceted alignment process involving supervised fine-tuning, preference optimization, and self-improvement, which cumulatively enable these smaller models to produce high-quality embedding data efficiently.

Research Methodology

Framework Design: SPEED comprises three core model roles—a junior generator for initial data synthesis, a senior generator for enhanced data production through preference optimization, and a data revisor for refinement. This layered approach employs a sequence of steps designed to extract and distill the knowledge embedded in large models like GPT-4 back into smaller models.
Training Protocols: The junior generator undergoes standard supervised fine-tuning using a small seed dataset synthesized by GPT-4. Subsequently, the preferred outputs from this junior generator are identified using GPT-4, guiding the further optimization of the senior generator through Direct Preference Optimization (DPO).
Data Revision: A separate, small model—the data revisor—refines the synthesized data, reducing errors and improving overall quality without significantly increasing inference costs. This contrasts sharply with the high costs of retaining persistent calls to larger models.

Results and Findings

SPEED demonstrates competitive performance against state-of-the-art systems. Enhancement in architecture leads to superior quality embeddings, validated across extensive benchmarks such as the MTEB. Notably, SPEED manages this performance while using less than a tenth of the GPT API calls compared to more traditional systems like E5 $_\text{mistral}$ . Interestingly, a log-linear relationship was observed between model performance and synthetic data size, offering insights into the effective scaling of synthetic datasets.

Contributions and Implications

The paper contributes significantly to the field by demonstrating that smaller models, when sufficiently aligned, can mimic and, in certain aspects, exceed the capabilities traditionally reserved for larger, costlier models. By reducing reliance on GPT-4 in the synthesis of text embedding data, SPEED shows a practical pathway to scalable AI solutions, democratizing access to high-performance models and potentially shifting future developments in AI training towards more sustainable and accessible approaches.

Future Directions

Refinement of Training Signals: With better task evaluation methodologies and more refined prompts, future iterations could further boost the quality of embedding synthesis.
Advanced Optimization Techniques: Refinements in optimization processes, such as step-based optimization strategies, can be explored to enhance model alignment further.
Application of Advanced Models: Future explorations might consider adapting newer models as base models for synthesis to improve both the quality and efficiency of embedding generation.

This research marks a significant step in embedding model training, suggesting viable pathways for reducing training costs while maintaining performance, thus opening up further discourse on efficient AI training paradigms.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Find Related Papers

Authors (7)

Tweets

https://twitter.com/_reachsumit/status/1849668705694908456

https://twitter.com/fly51fly/status/1850650084943573464

https://twitter.com/arxivsanitybot/status/1850366030096281865

https://twitter.com/GAIS_jp/status/1853572985359126665

https://twitter.com/knishimae0531/status/1850050937513181268

https://twitter.com/chidambara09/status/1853620838957744403