Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

Gemini 2.5 Flash 91 tok/s

Gemini 2.5 Pro 53 tok/s Pro

GPT-5 Medium 29 tok/s

GPT-5 High 26 tok/s Pro

GPT-4o 98 tok/s

GPT OSS 120B 470 tok/s Pro

Kimi K2 216 tok/s Pro

2000 character limit reached

QZhou-Embedding Technical Report (2508.21632v1)

Published 29 Aug 2025 in cs.CL and cs.AI

Abstract: We present QZhou-Embedding, a general-purpose contextual text embedding model with exceptional text representation capabilities. Built upon the Qwen2.5-7B-Instruct foundation model, we designed a unified multi-task framework comprising specialized data transformation and training strategies. The data transformation scheme enables the incorporation of more diverse textual training datasets, while the task-specific training strategies enhance model learning efficiency. We developed a data synthesis pipeline leveraging LLM API, incorporating techniques such as paraphrasing, augmentation, and hard negative example generation to improve the semantic richness and sample difficulty of the training set. Additionally, we employ a two-stage training strategy, comprising initial retrieval-focused pretraining followed by full-task fine-tuning, enabling the embedding model to extend its capabilities based on robust retrieval performance. Our model achieves state-of-the-art results on the MTEB and CMTEB benchmarks, ranking first on both leaderboards (August 27 2025), and simultaneously achieves state-of-the-art performance on tasks including reranking, clustering, etc. Our findings demonstrate that higher-quality, more diverse data is crucial for advancing retrieval model performance, and that leveraging LLMs generative capabilities can further optimize data quality for embedding model breakthroughs. Our model weights are released on HuggingFace under Apache 2.0 license. For reproducibility, we provide evaluation code and instructions on GitHub.

Collections

Summary

The paper introduces a unified multi-task framework for text embeddings, integrating retrieval, NLI, and classification tasks with specialized data transformation techniques.
It leverages advanced LLM-powered data synthesis, including paraphrasing, augmentation, and hard negative generation, to enhance semantic discrimination and model robustness.
The model achieves state-of-the-art results on MTEB and CMTEB benchmarks, demonstrating scalability and effectiveness in various AI applications.

QZhou-Embedding: A Unified Multi-Task Framework for State-of-the-Art Text Embeddings

Introduction

QZhou-Embedding introduces a general-purpose contextual text embedding model built upon the Qwen2.5-7B-Instruct foundation. The model is designed to address the increasing demands for robust, versatile text representations in retrieval-augmented generation, question answering, recommendation, and agent systems. The report details a unified multi-task learning framework, advanced data transformation and synthesis strategies, and a two-stage training paradigm, culminating in state-of-the-art performance on both MTEB and CMTEB benchmarks.

Unified Multi-Task Learning Framework

The framework categorizes training data into three principal task types: retrieval, natural language inference (NLI), and classification. Each task type is supported by customized data transformation pipelines and loss functions, enabling the model to extract features from heterogeneous sources and optimize for multiple downstream tasks.

Retrieval: Data is transformed from sources such as MS MARCO, news, QA, and claim-evidence datasets. The InfoNCE loss is augmented with query-query negative sampling, increasing the discriminative power of the embeddings.
NLI: Semantic similarity and entailment datasets are reformulated into text-pair-score triplets, compatible with Cosent loss, which leverages ordinal label information for ranking-sensitive optimization.
Classification: Example-based processing is employed, with in-batch negative sampling and a masking mechanism to prevent false negatives from same-class samples. The InfoNCE objective is retained, with label-based masking applied during loss computation.

The architecture modifies the Qwen2.5-7B-Instruct model to use bi-directional attention and mean pooling, enhancing contextual representation and normalization of output vectors.

Data Synthesis Pipeline

To address data scarcity and improve generalization, QZhou-Embedding leverages LLM APIs for automated data synthesis across three dimensions:

Paraphrasing: LLMs generate structurally diverse variants of queries and positives, ensuring semantic equivalence while introducing syntactic and grammatical variation.
Augmentation: Semantic diversity is increased by prompting LLMs to expand (query, positive) pairs into different topics, aspects, and viewpoints, anchored in the original context.
Hard Negative Generation: LLMs synthesize challenging negatives that are structurally and semantically similar to positives but deviate in relevance or aspect, maximizing discriminative challenge.

These strategies are applied selectively based on dataset size and task type, with paraphrasing and augmentation reserved for smaller datasets and hard negatives generated for retrieval tasks.

Training Optimization

Data Grouping Strategy

Training batches are constructed by sampling exclusively from single datasets, rather than mixing tasks or domains. Sampling weights are computed based on dataset size and an exponential scaling factor, ensuring domain-specific clustering and balanced representation.

Two-Stage Training Paradigm

Stage 1: Retrieval-only training establishes a strong foundation for retrieval performance.
Stage 2: Full-task fine-tuning integrates retrieval, NLI, and classification data, with a global control parameter $\eta$ regulating the proportion of retrieval data. This prevents degradation of retrieval performance when expanding to other tasks.

Full-parameter fine-tuning is employed throughout, eschewing LoRA or partial adaptation methods to maximize performance gains.

Experimental Results

QZhou-Embedding is trained on a diverse corpus exceeding 11M quadruples, incorporating major open-source datasets (MS MARCO, SQuAD, NQ, ELI5, MIRACL, etc.), high-quality triplets, and synthetically generated negatives. Data deduplication and contamination exclusion are rigorously applied.

On MTEB and CMTEB leaderboards, QZhou-Embedding achieves top-ranked average scores across all major task types:

MTEB (English): Mean task score 75.97, mean task-type score 69.52, outperforming all prior models in pair classification and retrieval.
CMTEB (Chinese): Mean task score 76.99, mean task-type score 78.58, with notable gains in pair classification and reranking.

The model demonstrates robust performance across classification, clustering, semantic similarity, and reranking, validating the effectiveness of the unified framework and data synthesis pipeline.

Implications and Future Directions

QZhou-Embedding establishes that data quality, diversity, and advanced synthesis are pivotal for advancing embedding model capabilities. The unified multi-task approach enables efficient cross-domain and cross-task optimization, while LLM-powered data augmentation and hard negative generation set new standards for training corpora.

Practically, the model is well-suited for deployment in retrieval-augmented generation, agent systems, and knowledge base construction, with strong real-time and long-context capabilities. The full-parameter fine-tuning and bi-directional attention modifications ensure scalability and adaptability to new domains.

Theoretically, the work underscores the importance of multi-task learning, dynamic data transformation, and automated data synthesis in embedding model research. Future developments may focus on multimodal and multilingual extensions, further integration with agent architectures, and exploration of more sophisticated synthesis and mining techniques.

Conclusion

QZhou-Embedding presents a comprehensive solution for general-purpose text embeddings, combining a unified multi-task framework, advanced data synthesis, and optimized training strategies. The model achieves state-of-the-art results on major benchmarks, demonstrating the critical role of data diversity and synthesis in embedding model advancement. Future work will extend these principles to multimodal and multilingual contexts, further enhancing the applicability and performance of embedding models in complex AI systems.

PDF Markdown

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

Authors (5)

Tweets

https://twitter.com/_reachsumit/status/1962403953670520966

https://twitter.com/din0s_/status/1962442353790058706

alphaXiv

QZhou-Embedding Technical Report (15 likes, 0 questions)