QZhou-Embedding: Robust Text Representation
- QZhou-Embedding is a state-of-the-art contextual text embedding model that leverages bi-directional attention to capture comprehensive semantic information.
- It employs a unified multi-task training framework integrating retrieval, natural language inference, and classification with innovative LLM-based data synthesis.
- The model achieves leading benchmark results on MTEB and CMTEB, and its reproducible design supports further research and practical deployment.
QZhou-Embedding is a state-of-the-art contextual text embedding model designed for robust, multi-task text representation. Building on the Qwen2.5-7B-Instruct foundation model, it integrates advanced architectural modifications and multi-faceted training strategies to achieve leading performance on diverse benchmarks, notably the MTEB and CMTEB, where it ranks first as of August 27, 2025. Its design enables enhanced retrieval, clustering, reranking, and classification performance, supported by innovative approaches in data transformation and synthesis.
1. Architectural Principles and Embedding Mechanism
QZhou-Embedding modifies the original decoder-only causal attention structure of Qwen2.5-7B-Instruct by introducing bi-directional attention layers, which allow each token to attend to both left and right context within an input sequence. For downstream embedding tasks, token representations are aggregated via mean pooling and normalization: where denotes the token count. This architecture ensures that the embedding vector incorporates comprehensive semantic information, extending the context-awareness originally afforded by Qwen2.5-7B-Instruct.
2. Data Transformation and Multi-Task Framework
Within a unified multi-task learning framework, QZhou-Embedding employs specialized data transformation schemes to standardize and enhance training samples for retrieval, natural language inference (NLI), and classification:
- Retrieval Tasks: Heterogeneous samples (e.g., title-body, claim-evidence, QA pairs) are normalized into pairs, with long documents truncated appropriately.
- NLI Tasks: Semantic similarity and entailment are cast into triplets, mapping labels to ordinal scores (e.g., 1 for match, 2/1/0 for entailment/neutral/contradiction), and optimized with ranking-sensitive losses such as Cosent.
- Classification Tasks: Samples are grouped by class, with intra-class pairs as positives and inter-class pairs as negatives, supported by an in-batch class masking mechanism to prevent false negatives.
The multi-task loss functions include InfoNCE-based retrieval loss: and Cosent ranking loss for NLI.
3. Data Synthesis Pipeline
Recognizing the centrality of data diversity and quality, QZhou-Embedding incorporates an overview pipeline using LLM APIs for three augmentation modalities:
- Paraphrasing: Structural diversity is achieved by LLM rewriting of queries and positives, varying syntax and word order while preserving semantics.
- Semantic Augmentation: The LLM expands context, introduces alternative perspectives, or elaborates on related semantic fields as additional positives.
- Hard Negative Generation: Negative samples are produced by LLMs that mimic the structure or topic domain of positives but diverge in meaning, increasing task difficulty.
This approach systematically refines the training corpus for greater semantic richness and sample difficulty, underpinning the model's generalization capacity.
4. Two-Stage Training Strategy
QZhou-Embedding's training proceeds in two distinct phases:
- Stage 1: Retrieval-Focused Pretraining — Training on curated retrieval examples establishes robust mapping from queries to relevant document embeddings.
- Stage 2: Full-Task Fine-Tuning — Combines retrieval, NLI, and classification data with a controlled sampling ratio parameter that preserves retrieval efficacy, facilitating simultaneous optimization for multiple downstream objectives.
This regimen ensures that the model delivers high-quality embeddings both for core retrieval and for broader semantic tasks.
5. Benchmark Evaluation and Task Performance
Comprehensive evaluation demonstrates that QZhou-Embedding achieves leading results across MTEB and CMTEB, with coverage of subtasks such as classification, clustering, pair classification, reranking, and semantic textual similarity (STS). The report provides quantitative tables showing superior average scores per task and aggregate ranking, validating its model's broad applicability.
6. Design Implications and Future Research Directions
Empirical findings highlight the decisive role of high-quality, diverse data for embedding models and show that integrating generative LLMs into synthesis pipelines yields significant gains. The technical report proposes continuing research along multimodal and multilingual axes, and deploying embedding models as foundational components for agent memory and retrieval augmentation systems.
A plausible implication is that bi-directional architectures paired with synthetic data augmentation will remain central themes in future advances, particularly for agent-oriented and cross-lingual embedding scenarios.
7. Model Accessibility and Reproducibility
QZhou-Embedding is distributed under the Apache 2.0 license via HuggingFace (https://huggingface.co/Kingsoft-LLM/QZhou-Embedding), with full evaluation code and usage instructions available on GitHub (https://github.com/Kingsoft-LLM/QZhou-Embedding). These resources facilitate reproducibility across academic and industrial research contexts, supporting further experimentation and integration.
QZhou-Embedding represents a comprehensive evolution in text representation, encompassing architectural advances, enriched data synthesis, and unified multi-task optimization (Yu et al., 29 Aug 2025). Its release marks a substantial contribution to the embedding model landscape, with reproducibility provisions enabling ongoing research and practical deployment.