Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

KaLM-Embedding-V2

Updated 1 July 2025
  • KaLM-Embedding-V2 is a compact, multilingual text embedding model that achieves state-of-the-art performance on MTEB benchmarks for Chinese and English, often matching much larger models.
  • It utilizes architectural innovations like a fully bidirectional Transformer base and advanced training techniques including focal-style loss reweighting and online hard-negative mixing.
  • The model's strong performance-to-size ratio and multilingual capability make it suitable for efficient deployment in applications like dense retrieval and RAG systems.

KaLM-Embedding-V2 is a versatile, compact, multilingual text embedding model designed for general-purpose sentence, document, and instruction embedding. It achieves state-of-the-art performance on the Massive Text Embedding Benchmark (MTEB) for both Chinese and English, frequently rivaling models many times larger in parameter count. Developed by the HIT-SCIR/HITsz-TMG group, KaLM-Embedding-V2 introduces innovations at both the architectural and training pipeline levels to enhance semantic representation, generalization, and efficiency.

1. Architectural Modifications

KaLM-Embedding-V2 is based on Qwen2-0.5B, a LLM originally using a causal (autoregressive) attention mask. Unlike its predecessors, KaLM-Embedding-V2 removes this causal mask, transforming the base into a fully bidirectional Transformer suited for representation learning tasks where each token may attend to all others in both directions. This design aligns the model's inductive bias with the requirements of embedding (holistic sequence understanding), in contrast to the unidirectional context required for next-token prediction.

After token embeddings are generated for an input sequence T\mathcal{T}, mean pooling is used to derive the final fixed-length embedding:

E=P(Temb)\mathbf{E} = \mathcal{P}(\mathbf{T}_{\mathrm{emb}})

where Temb=K(T)\mathbf{T}_{\mathrm{emb}} = \mathcal{K}(\mathcal{T}) is a matrix of size L×dL \times d (sequence length × hidden size) and P()\mathcal{P}(\cdot) denotes mean pooling. This approach is empirically shown to provide robust performance across diverse tasks.

Typical input formatting includes an explicit instruction, enabling unification of multiple datasets and tasks:

1
Instruct: {task instruction} Query: {query}

2. Data Curation and Preparation

KaLM-Embedding-V2's performance is strongly attributed to its scale and diversity of training data. A multi-stage data pipeline is constructed:

  • Pre-training Data: Spanning over 20 categories, the weakly supervised, open-domain corpora include Amazon Reviews, CC-News, NLLB, Wikipedia, CSL, Wudao, THUCNews, Zhihu-KOL, Reddit, StackExchange, S2ORC, CodeSearchNet, and PAQ. Multiple languages are present, with a focus on Chinese and English.
  • Fine-tuning Data: Over 100 high-quality, annotated datasets are used, covering retrieval, reranking, classification, clustering, semantic similarity, and paraphrase tasks. Examples include ELI5, HotpotQA, MSMARCO, DuReader, Multi-CPR, RefGPT, SNLI, PAWS-X, and MIRACL. Dataset curation ensures no leakage into MTEB evaluation splits.

Each instance is augmented with task-specific instructions, supporting the model’s generalization across tasks and datasets.

3. Training Techniques and Loss Functions

KaLM-Embedding-V2 employs a three-stage training process:

3.1. Pre-training

Pre-training leverages the InfoNCE contrastive objective:

L=EiN[loges(qi,pi+)/τZi]\mathcal{L} = \underset{i\in N}{\mathbb{E}} \left[ -\log \frac{e^{s(\mathbf{q}_i, \mathbf{p}_i^+)/\tau}}{Z_i} \right]

with

Zi=es(qi,pi+)/τ+jiNes(qi,pj+)/τ+jNkMes(qi,pj,k)/τZ_i = e^{s(\mathbf{q}_i, \mathbf{p}_i^+)/\tau} + \sum_{j\neq i}^{N} e^{s(\mathbf{q}_i, \mathbf{p}_j^+)/\tau} + \sum_{j}^{N}\sum_{k}^{M} e^{s(\mathbf{q}_i, \mathbf{p}_{j,k}^-)/\tau}

where s()s(\cdot) is cosine similarity, τ\tau is a temperature hyperparameter, and MM is the number of hard negatives per query.

3.2. Fine-tuning

Fine-tuning continues contrastive learning, now integrating hard negatives (see below) and refined objectives tailored for supervised retrieval, reranking, classification, and clustering tasks, further increasing representation quality and consistency.

3.3. Model Soup (Parameter Averaging)

Inspired by model soup strategies, weights from several complementary checkpoints are averaged post fine-tuning. Empirically, this procedure provides additional generalization, sometimes outperforming any single constituent model.

3.4. Focal-Style Reweighting

To address imbalance where easy pairs dominate loss contribution, a focal loss variant is introduced:

wi=(1es(qi,pi+)/τZi)γw_i = \left(1 - \frac{e^{s(\mathbf{q}_i, \mathbf{p}_i^+)/\tau}}{Z_i}\right)^\gamma

L=EiN[wiloges(qi,pi+)/τZi]\mathcal{L} = \underset{i\in N}{\mathbb{E}} \left[ - w_i \log \frac{e^{s(\mathbf{q}_i, \mathbf{p}_i^+)/\tau}}{Z_i} \right]

with focusing parameter γ=0.5\gamma=0.5. This up-weights hard informative samples and yields a measured improvement on MTEB Chinese benchmarks (+1.4 points).

3.5. Online Hard-Negative Mixing

Offline mining of hard negatives is resource-intensive and static. KaLM-Embedding-V2 introduces in-training negative mixing by combining negative embeddings from the candidate pool:

  • Pairwise mixing:

h~i=λpi,j+(1λ)pi,k\tilde{\mathbf{h}}_{i}^- = \lambda \mathbf{p}_{i, j}^- + (1-\lambda) \mathbf{p}_{i, k}^-

Where λ\lambda is Beta(2, 2) sampled.

  • Listwise mixing:

s~i=λ1pi,1++λMpi,M(λk=1)\tilde{\mathbf{s}}_{i}^- = \lambda_1 \mathbf{p}_{i, 1}^- + \ldots + \lambda_M \mathbf{p}_{i, M}^- \quad (\sum \lambda_k = 1)

The resulting hard negatives are incorporated into the denominator of the contrastive loss, enhancing the model’s ability to discriminate among semantically similar inputs with minimal computational cost.

4. Empirical Evaluation and Benchmarks

KaLM-Embedding-V2 is benchmarked on the MTEB suite (Chinese and English), which includes tasks across retrieval, reranking, classification, clustering, semantic textual similarity, and summarization.

Summary of results for models <1B parameters (Section 4, Table 1):

Model Params zh (mean) en (mean)
KaLM-Embedding-V2 (mini-instruct) 494M 68.15 67.47
jina-embeddings-v3-base 800M 65.41 62.01
bge-m3 (Dense) 560M 65.45 64.62
gte-multilingual-base (Dense) 305M 64.75 61.42
multilingual-e5-large 560M 58.96 61.23

KaLM-Embedding-V2 not only outperforms all models under 1B parameters, but on Chinese tasks, exceeds bge-multilingual-gemma2 (9B) and performs closely with gte-Qwen2-1.5B-instruct and other models several times larger.

The model achieves best or second-best results across five of six Chinese and five of seven English macro task categories, and demonstrates robust performance for both retrieval-centric and classification-type tasks. The efficiency and competitive accuracy suggest suitability for production scenarios such as retrieval-augmented generation (RAG).

5. Innovative Contributions

KaLM-Embedding-V2 is distinguished by several key advances, which are summarized below:

Innovation Purpose/Description Outcome
Bidirectional Training (no causal mask) Aligning architecture with embedding tasks Stronger, more contextualized representations
Instruction-augmented queries Task- and domain-awareness Enhanced generalization, multi-task unification
Focal-style loss reweighting Greater emphasis on hard/informative samples Improved avg. Chinese MTEB (+1.4)
Online hard-negative mixing Low-cost, continual enrichment of negatives Increased discriminative power
Model soup parameter averaging Post-hoc model merging for generalization Surpasses single-checkpoint averages
Broad, multilingual data coverage Robustness and cross-lingual transfer Consistent SOTA results for <1B params

The combination of these practices leads to an embedding model that is compact (494M parameters, 896-dim vector), robust, and deployable in low-resource or production settings without major sacrifices in performance.

6. Practical Implications and Usage

KaLM-Embedding-V2 is designed for deployment across a range of retrieval and inference intensive applications, including but not limited to:

  • Text and document retrieval in multi- or cross-lingual scenarios
  • Knowledge graph completion, entity/relation retrieval, and question answering
  • Embedding for clustering, classification, and semantic similarity search (dense retrieval)
  • Integration as a plug-and-play sentence embedding service within larger language or RAG systems

The model’s configuration (mean pooling, instruction prompts) and open-source checkpoint availability suggest straightforward integration with modern NLP frameworks and pipelines. Its performance/size ratio renders it suitable for cost- or latency-sensitive environments, especially those requiring strong multilingual capability.


KaLM-Embedding-V2 represents a refinement of both architectural alignment and training methodology for embedding models, substantiated by broad multilingual data curation, effective negative sampling, and a generalization-focused model averaging strategy. It offers a practical and highly competitive solution for diverse embedding needs, frequently equaling or outperforming much larger models on public benchmarks.

Official resources include the HIT-TMG/Hugging Face KaLM-Embedding-Multilingual-Mini-Instruct-V2 model and project GitHub repository.