Introduction
The development of text embeddings has been instrumental in advancing the field of natural language processing. These embeddings represent text as low-dimensional vectors, facilitating efficient retrieval and matching between texts, which are widely applicable in retrieval, clustering, and classification tasks. Although pre-trained LLMs such as BERT and GPT are capable of producing transferable text representations, they are suboptimal for single-vector embedding tasks. The paper introduces a new approach for generating high-quality text embeddings through contrastive pre-training with weak supervision.
Data Curation and Methodology
The cornerstone of this approach is the dataset, termed CCPairs—a curated collection of large-scale text pairs, which have been extracted from semi-structured web sources and filtered for quality using a consistency-based approach. This dataset enables contrastive learning, where the model distinguishes between relevant text pairs and numerous irrelevant counterparts within a large batch of examples. By leveraging weak supervision from heterogeneous sources like CommunityQA, Common Crawl, and scientific papers, the model, E5, is trained contrastively using in-batch negatives.
Model Performance
E5's performance is rigorously evaluated on the BEIR and MTEB benchmarks. Remarkably, without relying on any labeled data, E5 outperforms the strong BM25 baseline on the BEIR zero-shot retrieval benchmark. When fine-tuned with labeled data, its effectiveness further escalates, surpassing other embedding models with significantly more parameters. The fine-tuning method involves a blend of datasets that impart human knowledge into the model, refining its capacity for superior text embedding tasks.
Applications and Analysis
The core contribution, the E5 model, exhibits versatility and efficiency, catering to tasks demanding single-vector text representations. It is beneficial for zero-shot retrieval, few-shot and zero-shot text classification, semantic textual similarity, and text clustering. In summary, E5 sets a new precedent for general-purpose text embeddings, suitable for a vast array of applications and demonstrating empirical gains despite having fewer parameters compared to some of the larger models available. However, questions surrounding the ability to achieve state-of-the-art embeddings solely from self-supervision remain open for exploration.