Unknown Training Procedures for Proprietary API Embedding Models

Ascertain the training procedures of proprietary API embedding models, specifically OpenAI Text-Embedding-v3-Large, Cohere v3 English, and Google Gecko, including whether these models were trained on instruction-following data, to accurately categorize and evaluate their instruction-following capabilities in information retrieval settings.

Background

The paper evaluates several API-based embedding models alongside open-source systems to measure instruction-following in information retrieval. The authors note that for the API models, crucial details about their training regimes are not publicly disclosed, including whether instruction-tuning was part of their training. This uncertainty motivates placing them in a distinct evaluation category and complicates interpretation of their performance on instruction-following benchmarks.

Clarifying whether and how these models were trained on instructions would help the community understand their capabilities, compare them fairly to open-source instruction-tuned models, and draw reliable conclusions about instruction-following behavior in IR. The authors observe that Google’s Gecko model explicitly reports training with instructions, but details for other API models remain undisclosed.

References

It is mostly unknown what these models' training procedures were---including if they were trained on instructions or not---thus we place them in a distinct category. However, we note that Google's model did explicitly train with instructions, as mentioned in their technical report.

— FollowIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions (2403.15246 - Weller et al., 2024) in Section 4.1 (Evaluation Settings), API Models paragraph

Unknown Training Procedures for Proprietary API Embedding Models

Background

References

Related Problems