Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models (2506.05176v3)

Published 5 Jun 2025 in cs.CL

Abstract: In this work, we introduce the Qwen3 Embedding series, a significant advancement over its predecessor, the GTE-Qwen series, in text embedding and reranking capabilities, built upon the Qwen3 foundation models. Leveraging the Qwen3 LLMs' robust capabilities in multilingual text understanding and generation, our innovative multi-stage training pipeline combines large-scale unsupervised pre-training with supervised fine-tuning on high-quality datasets. Effective model merging strategies further ensure the robustness and adaptability of the Qwen3 Embedding series. During the training process, the Qwen3 LLMs serve not only as backbone models but also play a crucial role in synthesizing high-quality, rich, and diverse training data across multiple domains and languages, thus enhancing the training pipeline. The Qwen3 Embedding series offers a spectrum of model sizes (0.6B, 4B, 8B) for both embedding and reranking tasks, addressing diverse deployment scenarios where users can optimize for either efficiency or effectiveness. Empirical evaluations demonstrate that the Qwen3 Embedding series achieves state-of-the-art results across diverse benchmarks. Notably, it excels on the multilingual evaluation benchmark MTEB for text embedding, as well as in various retrieval tasks, including code retrieval, cross-lingual retrieval and multilingual retrieval. To facilitate reproducibility and promote community-driven research and development, the Qwen3 Embedding models are publicly available under the Apache 2.0 license.

Summary

  • The paper introduces a novel Qwen3 Embedding series that enhances text embedding and reranking using a multi-stage training process with diverse multilingual datasets.
  • It employs dense network architectures with causal attention and supervised fine-tuning, achieving state-of-the-art results on MTEB benchmarks including multilingual and code retrieval.
  • The study highlights the effectiveness of synthetic data and instruction-based model merging, paving the way for further advancements in LLM-driven text processing.

Overview of Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

The paper "Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models" introduces the Qwen3 Embedding series, a novel and substantial improvement over the GTE-Qwen series in text embedding and reranking applications, developed upon the Qwen3 foundation models. This work exemplifies leveraging LLMs' (LLMs) capabilities in multilingual text understanding and generation through a multi-stage training pipeline that combines large-scale unsupervised pre-training and supervised fine-tuning using high-quality datasets. This strategy optimizes the model's robustness and adaptability for diverse scenarios.

Model Architecture and Training

The core components of the Qwen3 Embedding series are grounded in dense networks built on Qwen3 foundation models, available in variants comprising 0.6B, 4B, and 8B parameters. The models are tailored for embedding and reranking tasks via LLM-driven training protocols, which include synthesizing high-quality training data from multiple domains and languages. The training process encompasses several stages: initial unsupervised pre-training, supervised fine-tuning, and model merging strategies. These facilitate robust model architectures suitable for various deployment settings. The embedding models derive representations via causal attention mechanism, while reranking models follow an instruction-aware template to evaluate text relevance effectively.

This architectural configuration enables encoding of semantic relationships within text embeddings and formulates reranking strategies that prioritize document relevancy. During pre-training, synthetic data generated using LLMs aids in achieving broad generalization across downstream tasks. The multitask dataset further enhances embeddings' interpretative accuracy across complex NLP functions such as retrieval and STS.

Empirical Evaluation

The empirical assessment of the Qwen3 Embedding series strongly indicates its competence in achieving state-of-the-art results across diverse benchmarks. Notably, the models excelled in multilingual evaluation tasks within MTEB and demonstrated significant improvements in retrieval tasks, including code retrieval, cross-lingual retrieval, and multilingual retrieval. For example, the flagship model, Qwen3-8B-Embedding, achieved scores of 70.58 on the MTEB Multilingual benchmark and 80.68 on the MTEB Code benchmark, surpassing the previous leading model, Gemini-Embedding.

Discussion and Future Directions

This research has several theoretical implications: it illustrates the effectiveness of exploiting LLM capabilities in Text Embedding tasks and reranking through model merging and synthetic data utilization. Practically, the models exhibit versatility due to design features allowing adjustable embedding dimensions and customizable instructions for downstream task adaptation. The enhanced performance indicates the potential for broader application frameworks in NLP tasks, reinforcing the significance of multi-stage training processes and synthetic dataset utility.

In future research, the exploration of further expanding LLM integration for dataset synthesis, improved model merging techniques, and more nuanced instruction-based model development could provide substantial benefits. These prospects herald promising advancements in adapting LLMs to emerging needs in AI-driven text processing and retrieval systems. Overall, the Qwen3 Embedding series represents a significant step towards more efficient and scalable embedding models that align with evolving NLP demands.