Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

173 tokens/sec

GPT-4o

7 tokens/sec

Gemini 2.5 Pro Pro

46 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

KaLM-Embedding-V2

Updated 1 July 2025

KaLM-Embedding-V2 is a compact, multilingual text embedding model that achieves state-of-the-art performance on MTEB benchmarks for Chinese and English, often matching much larger models.
It utilizes architectural innovations like a fully bidirectional Transformer base and advanced training techniques including focal-style loss reweighting and online hard-negative mixing.
The model's strong performance-to-size ratio and multilingual capability make it suitable for efficient deployment in applications like dense retrieval and RAG systems.

KaLM-Embedding-V2 is a versatile, compact, multilingual text embedding model designed for general-purpose sentence, document, and instruction embedding. It achieves state-of-the-art performance on the Massive Text Embedding Benchmark (MTEB) for both Chinese and English, frequently rivaling models many times larger in parameter count. Developed by the HIT-SCIR/HITsz-TMG group, KaLM-Embedding-V2 introduces innovations at both the architectural and training pipeline levels to enhance semantic representation, generalization, and efficiency.

1. Architectural Modifications

KaLM-Embedding-V2 is based on Qwen2-0.5B, a LLM originally using a causal (autoregressive) attention mask. Unlike its predecessors, KaLM-Embedding-V2 removes this causal mask, transforming the base into a fully bidirectional Transformer suited for representation learning tasks where each token may attend to all others in both directions. This design aligns the model's inductive bias with the requirements of embedding (holistic sequence understanding), in contrast to the unidirectional context required for next-token prediction.

After token embeddings are generated for an input sequence $\mathcal{T}$ , mean pooling is used to derive the final fixed-length embedding:

$\mathbf{E} = \mathcal{P}(\mathbf{T}_{\mathrm{emb}})$

where $\mathbf{T}_{\mathrm{emb}} = \mathcal{K}(\mathcal{T})$ is a matrix of size $L \times d$ (sequence length × hidden size) and $\mathcal{P}(\cdot)$ denotes mean pooling. This approach is empirically shown to provide robust performance across diverse tasks.

Typical input formatting includes an explicit instruction, enabling unification of multiple datasets and tasks:

1	Instruct: {task instruction} Query: {query}

2. Data Curation and Preparation

KaLM-Embedding-V2's performance is strongly attributed to its scale and diversity of training data. A multi-stage data pipeline is constructed:

Pre-training Data: Spanning over 20 categories, the weakly supervised, open-domain corpora include Amazon Reviews, CC-News, NLLB, Wikipedia, CSL, Wudao, THUCNews, Zhihu-KOL, Reddit, StackExchange, S2ORC, CodeSearchNet, and PAQ. Multiple languages are present, with a focus on Chinese and English.
Fine-tuning Data: Over 100 high-quality, annotated datasets are used, covering retrieval, reranking, classification, clustering, semantic similarity, and paraphrase tasks. Examples include ELI5, HotpotQA, MSMARCO, DuReader, Multi-CPR, RefGPT, SNLI, PAWS-X, and MIRACL. Dataset curation ensures no leakage into MTEB evaluation splits.

Each instance is augmented with task-specific instructions, supporting the model’s generalization across tasks and datasets.

3. Training Techniques and Loss Functions

KaLM-Embedding-V2 employs a three-stage training process:

3.1. Pre-training

Pre-training leverages the InfoNCE contrastive objective:

$\mathcal{L} = \underset{i\in N}{\mathbb{E}} \left[ -\log \frac{e^{s(\mathbf{q}_i, \mathbf{p}_i^+)/\tau}}{Z_i} \right]$

with

$Z_i = e^{s(\mathbf{q}_i, \mathbf{p}_i^+)/\tau} + \sum_{j\neq i}^{N} e^{s(\mathbf{q}_i, \mathbf{p}_j^+)/\tau} + \sum_{j}^{N}\sum_{k}^{M} e^{s(\mathbf{q}_i, \mathbf{p}_{j,k}^-)/\tau}$

where $s(\cdot)$ is cosine similarity, $\tau$ is a temperature hyperparameter, and $M$ is the number of hard negatives per query.

3.2. Fine-tuning

Fine-tuning continues contrastive learning, now integrating hard negatives (see below) and refined objectives tailored for supervised retrieval, reranking, classification, and clustering tasks, further increasing representation quality and consistency.

3.3. Model Soup (Parameter Averaging)

Inspired by model soup strategies, weights from several complementary checkpoints are averaged post fine-tuning. Empirically, this procedure provides additional generalization, sometimes outperforming any single constituent model.

3.4. Focal-Style Reweighting

To address imbalance where easy pairs dominate loss contribution, a focal loss variant is introduced:

$w_i = \left(1 - \frac{e^{s(\mathbf{q}_i, \mathbf{p}_i^+)/\tau}}{Z_i}\right)^\gamma$

$\mathcal{L} = \underset{i\in N}{\mathbb{E}} \left[ - w_i \log \frac{e^{s(\mathbf{q}_i, \mathbf{p}_i^+)/\tau}}{Z_i} \right]$

with focusing parameter $\gamma=0.5$ . This up-weights hard informative samples and yields a measured improvement on MTEB Chinese benchmarks (+1.4 points).

3.5. Online Hard-Negative Mixing

Offline mining of hard negatives is resource-intensive and static. KaLM-Embedding-V2 introduces in-training negative mixing by combining negative embeddings from the candidate pool:

Pairwise mixing:

$\tilde{\mathbf{h}}_{i}^- = \lambda \mathbf{p}_{i, j}^- + (1-\lambda) \mathbf{p}_{i, k}^-$

Where $\lambda$ is Beta(2, 2) sampled.

Listwise mixing:

$\tilde{\mathbf{s}}_{i}^- = \lambda_1 \mathbf{p}_{i, 1}^- + \ldots + \lambda_M \mathbf{p}_{i, M}^- \quad (\sum \lambda_k = 1)$

The resulting hard negatives are incorporated into the denominator of the contrastive loss, enhancing the model’s ability to discriminate among semantically similar inputs with minimal computational cost.

4. Empirical Evaluation and Benchmarks

KaLM-Embedding-V2 is benchmarked on the MTEB suite (Chinese and English), which includes tasks across retrieval, reranking, classification, clustering, semantic textual similarity, and summarization.

Summary of results for models <1B parameters (Section 4, Table 1):

Model	Params	zh (mean)	en (mean)
KaLM-Embedding-V2 (mini-instruct)	494M	68.15	67.47
jina-embeddings-v3-base	800M	65.41	62.01
bge-m3 (Dense)	560M	65.45	64.62
gte-multilingual-base (Dense)	305M	64.75	61.42
multilingual-e5-large	560M	58.96	61.23

KaLM-Embedding-V2 not only outperforms all models under 1B parameters, but on Chinese tasks, exceeds bge-multilingual-gemma2 (9B) and performs closely with gte-Qwen2-1.5B-instruct and other models several times larger.

The model achieves best or second-best results across five of six Chinese and five of seven English macro task categories, and demonstrates robust performance for both retrieval-centric and classification-type tasks. The efficiency and competitive accuracy suggest suitability for production scenarios such as retrieval-augmented generation (RAG).

5. Innovative Contributions

KaLM-Embedding-V2 is distinguished by several key advances, which are summarized below:

Innovation	Purpose/Description	Outcome
Bidirectional Training (no causal mask)	Aligning architecture with embedding tasks	Stronger, more contextualized representations
Instruction-augmented queries	Task- and domain-awareness	Enhanced generalization, multi-task unification
Focal-style loss reweighting	Greater emphasis on hard/informative samples	Improved avg. Chinese MTEB (+1.4)
Online hard-negative mixing	Low-cost, continual enrichment of negatives	Increased discriminative power
Model soup parameter averaging	Post-hoc model merging for generalization	Surpasses single-checkpoint averages
Broad, multilingual data coverage	Robustness and cross-lingual transfer	Consistent SOTA results for <1B params

The combination of these practices leads to an embedding model that is compact (494M parameters, 896-dim vector), robust, and deployable in low-resource or production settings without major sacrifices in performance.

6. Practical Implications and Usage

KaLM-Embedding-V2 is designed for deployment across a range of retrieval and inference intensive applications, including but not limited to:

Text and document retrieval in multi- or cross-lingual scenarios
Knowledge graph completion, entity/relation retrieval, and question answering
Embedding for clustering, classification, and semantic similarity search (dense retrieval)
Integration as a plug-and-play sentence embedding service within larger language or RAG systems

The model’s configuration (mean pooling, instruction prompts) and open-source checkpoint availability suggest straightforward integration with modern NLP frameworks and pipelines. Its performance/size ratio renders it suitable for cost- or latency-sensitive environments, especially those requiring strong multilingual capability.

KaLM-Embedding-V2 represents a refinement of both architectural alignment and training methodology for embedding models, substantiated by broad multilingual data curation, effective negative sampling, and a generalization-focused model averaging strategy. It offers a practical and highly competitive solution for diverse embedding needs, frequently equaling or outperforming much larger models on public benchmarks.

Official resources include the HIT-TMG/Hugging Face KaLM-Embedding-Multilingual-Mini-Instruct-V2 model and project GitHub repository.

PDF Markdown Chat (Upgrade)