- The paper introduces a unified multi-task framework for text embeddings, integrating retrieval, NLI, and classification tasks with specialized data transformation techniques.
- It leverages advanced LLM-powered data synthesis, including paraphrasing, augmentation, and hard negative generation, to enhance semantic discrimination and model robustness.
- The model achieves state-of-the-art results on MTEB and CMTEB benchmarks, demonstrating scalability and effectiveness in various AI applications.
QZhou-Embedding: A Unified Multi-Task Framework for State-of-the-Art Text Embeddings
Introduction
QZhou-Embedding introduces a general-purpose contextual text embedding model built upon the Qwen2.5-7B-Instruct foundation. The model is designed to address the increasing demands for robust, versatile text representations in retrieval-augmented generation, question answering, recommendation, and agent systems. The report details a unified multi-task learning framework, advanced data transformation and synthesis strategies, and a two-stage training paradigm, culminating in state-of-the-art performance on both MTEB and CMTEB benchmarks.
Unified Multi-Task Learning Framework
The framework categorizes training data into three principal task types: retrieval, natural language inference (NLI), and classification. Each task type is supported by customized data transformation pipelines and loss functions, enabling the model to extract features from heterogeneous sources and optimize for multiple downstream tasks.
- Retrieval: Data is transformed from sources such as MS MARCO, news, QA, and claim-evidence datasets. The InfoNCE loss is augmented with query-query negative sampling, increasing the discriminative power of the embeddings.
- NLI: Semantic similarity and entailment datasets are reformulated into text-pair-score triplets, compatible with Cosent loss, which leverages ordinal label information for ranking-sensitive optimization.
- Classification: Example-based processing is employed, with in-batch negative sampling and a masking mechanism to prevent false negatives from same-class samples. The InfoNCE objective is retained, with label-based masking applied during loss computation.
The architecture modifies the Qwen2.5-7B-Instruct model to use bi-directional attention and mean pooling, enhancing contextual representation and normalization of output vectors.
Data Synthesis Pipeline
To address data scarcity and improve generalization, QZhou-Embedding leverages LLM APIs for automated data synthesis across three dimensions:
- Paraphrasing: LLMs generate structurally diverse variants of queries and positives, ensuring semantic equivalence while introducing syntactic and grammatical variation.
- Augmentation: Semantic diversity is increased by prompting LLMs to expand (query, positive) pairs into different topics, aspects, and viewpoints, anchored in the original context.
- Hard Negative Generation: LLMs synthesize challenging negatives that are structurally and semantically similar to positives but deviate in relevance or aspect, maximizing discriminative challenge.
These strategies are applied selectively based on dataset size and task type, with paraphrasing and augmentation reserved for smaller datasets and hard negatives generated for retrieval tasks.
Training Optimization
Data Grouping Strategy
Training batches are constructed by sampling exclusively from single datasets, rather than mixing tasks or domains. Sampling weights are computed based on dataset size and an exponential scaling factor, ensuring domain-specific clustering and balanced representation.
Two-Stage Training Paradigm
- Stage 1: Retrieval-only training establishes a strong foundation for retrieval performance.
- Stage 2: Full-task fine-tuning integrates retrieval, NLI, and classification data, with a global control parameter η regulating the proportion of retrieval data. This prevents degradation of retrieval performance when expanding to other tasks.
Full-parameter fine-tuning is employed throughout, eschewing LoRA or partial adaptation methods to maximize performance gains.
Experimental Results
QZhou-Embedding is trained on a diverse corpus exceeding 11M quadruples, incorporating major open-source datasets (MS MARCO, SQuAD, NQ, ELI5, MIRACL, etc.), high-quality triplets, and synthetically generated negatives. Data deduplication and contamination exclusion are rigorously applied.
On MTEB and CMTEB leaderboards, QZhou-Embedding achieves top-ranked average scores across all major task types:
- MTEB (English): Mean task score 75.97, mean task-type score 69.52, outperforming all prior models in pair classification and retrieval.
- CMTEB (Chinese): Mean task score 76.99, mean task-type score 78.58, with notable gains in pair classification and reranking.
The model demonstrates robust performance across classification, clustering, semantic similarity, and reranking, validating the effectiveness of the unified framework and data synthesis pipeline.
Implications and Future Directions
QZhou-Embedding establishes that data quality, diversity, and advanced synthesis are pivotal for advancing embedding model capabilities. The unified multi-task approach enables efficient cross-domain and cross-task optimization, while LLM-powered data augmentation and hard negative generation set new standards for training corpora.
Practically, the model is well-suited for deployment in retrieval-augmented generation, agent systems, and knowledge base construction, with strong real-time and long-context capabilities. The full-parameter fine-tuning and bi-directional attention modifications ensure scalability and adaptability to new domains.
Theoretically, the work underscores the importance of multi-task learning, dynamic data transformation, and automated data synthesis in embedding model research. Future developments may focus on multimodal and multilingual extensions, further integration with agent architectures, and exploration of more sophisticated synthesis and mining techniques.
Conclusion
QZhou-Embedding presents a comprehensive solution for general-purpose text embeddings, combining a unified multi-task framework, advanced data synthesis, and optimized training strategies. The model achieves state-of-the-art results on major benchmarks, demonstrating the critical role of data diversity and synthesis in embedding model advancement. Future work will extend these principles to multimodal and multilingual contexts, further enhancing the applicability and performance of embedding models in complex AI systems.