KaLM-Embedding-V2
- KaLM-Embedding-V2 is a compact, multilingual text embedding model that achieves state-of-the-art performance on MTEB benchmarks for Chinese and English, often matching much larger models.
- It utilizes architectural innovations like a fully bidirectional Transformer base and advanced training techniques including focal-style loss reweighting and online hard-negative mixing.
- The model's strong performance-to-size ratio and multilingual capability make it suitable for efficient deployment in applications like dense retrieval and RAG systems.
KaLM-Embedding-V2 is a versatile, compact, multilingual text embedding model designed for general-purpose sentence, document, and instruction embedding. It achieves state-of-the-art performance on the Massive Text Embedding Benchmark (MTEB) for both Chinese and English, frequently rivaling models many times larger in parameter count. Developed by the HIT-SCIR/HITsz-TMG group, KaLM-Embedding-V2 introduces innovations at both the architectural and training pipeline levels to enhance semantic representation, generalization, and efficiency.
1. Architectural Modifications
KaLM-Embedding-V2 is based on Qwen2-0.5B, a LLM originally using a causal (autoregressive) attention mask. Unlike its predecessors, KaLM-Embedding-V2 removes this causal mask, transforming the base into a fully bidirectional Transformer suited for representation learning tasks where each token may attend to all others in both directions. This design aligns the model's inductive bias with the requirements of embedding (holistic sequence understanding), in contrast to the unidirectional context required for next-token prediction.
After token embeddings are generated for an input sequence , mean pooling is used to derive the final fixed-length embedding:
where is a matrix of size (sequence length × hidden size) and denotes mean pooling. This approach is empirically shown to provide robust performance across diverse tasks.
Typical input formatting includes an explicit instruction, enabling unification of multiple datasets and tasks:
1 |
Instruct: {task instruction} Query: {query} |
2. Data Curation and Preparation
KaLM-Embedding-V2's performance is strongly attributed to its scale and diversity of training data. A multi-stage data pipeline is constructed:
- Pre-training Data: Spanning over 20 categories, the weakly supervised, open-domain corpora include Amazon Reviews, CC-News, NLLB, Wikipedia, CSL, Wudao, THUCNews, Zhihu-KOL, Reddit, StackExchange, S2ORC, CodeSearchNet, and PAQ. Multiple languages are present, with a focus on Chinese and English.
- Fine-tuning Data: Over 100 high-quality, annotated datasets are used, covering retrieval, reranking, classification, clustering, semantic similarity, and paraphrase tasks. Examples include ELI5, HotpotQA, MSMARCO, DuReader, Multi-CPR, RefGPT, SNLI, PAWS-X, and MIRACL. Dataset curation ensures no leakage into MTEB evaluation splits.
Each instance is augmented with task-specific instructions, supporting the model’s generalization across tasks and datasets.
3. Training Techniques and Loss Functions
KaLM-Embedding-V2 employs a three-stage training process:
3.1. Pre-training
Pre-training leverages the InfoNCE contrastive objective:
with
where is cosine similarity, is a temperature hyperparameter, and is the number of hard negatives per query.
3.2. Fine-tuning
Fine-tuning continues contrastive learning, now integrating hard negatives (see below) and refined objectives tailored for supervised retrieval, reranking, classification, and clustering tasks, further increasing representation quality and consistency.
3.3. Model Soup (Parameter Averaging)
Inspired by model soup strategies, weights from several complementary checkpoints are averaged post fine-tuning. Empirically, this procedure provides additional generalization, sometimes outperforming any single constituent model.
3.4. Focal-Style Reweighting
To address imbalance where easy pairs dominate loss contribution, a focal loss variant is introduced:
with focusing parameter . This up-weights hard informative samples and yields a measured improvement on MTEB Chinese benchmarks (+1.4 points).
3.5. Online Hard-Negative Mixing
Offline mining of hard negatives is resource-intensive and static. KaLM-Embedding-V2 introduces in-training negative mixing by combining negative embeddings from the candidate pool:
- Pairwise mixing:
Where is Beta(2, 2) sampled.
- Listwise mixing:
The resulting hard negatives are incorporated into the denominator of the contrastive loss, enhancing the model’s ability to discriminate among semantically similar inputs with minimal computational cost.
4. Empirical Evaluation and Benchmarks
KaLM-Embedding-V2 is benchmarked on the MTEB suite (Chinese and English), which includes tasks across retrieval, reranking, classification, clustering, semantic textual similarity, and summarization.
Summary of results for models <1B parameters (Section 4, Table 1):
Model | Params | zh (mean) | en (mean) |
---|---|---|---|
KaLM-Embedding-V2 (mini-instruct) | 494M | 68.15 | 67.47 |
jina-embeddings-v3-base | 800M | 65.41 | 62.01 |
bge-m3 (Dense) | 560M | 65.45 | 64.62 |
gte-multilingual-base (Dense) | 305M | 64.75 | 61.42 |
multilingual-e5-large | 560M | 58.96 | 61.23 |
KaLM-Embedding-V2 not only outperforms all models under 1B parameters, but on Chinese tasks, exceeds bge-multilingual-gemma2 (9B) and performs closely with gte-Qwen2-1.5B-instruct and other models several times larger.
The model achieves best or second-best results across five of six Chinese and five of seven English macro task categories, and demonstrates robust performance for both retrieval-centric and classification-type tasks. The efficiency and competitive accuracy suggest suitability for production scenarios such as retrieval-augmented generation (RAG).
5. Innovative Contributions
KaLM-Embedding-V2 is distinguished by several key advances, which are summarized below:
Innovation | Purpose/Description | Outcome |
---|---|---|
Bidirectional Training (no causal mask) | Aligning architecture with embedding tasks | Stronger, more contextualized representations |
Instruction-augmented queries | Task- and domain-awareness | Enhanced generalization, multi-task unification |
Focal-style loss reweighting | Greater emphasis on hard/informative samples | Improved avg. Chinese MTEB (+1.4) |
Online hard-negative mixing | Low-cost, continual enrichment of negatives | Increased discriminative power |
Model soup parameter averaging | Post-hoc model merging for generalization | Surpasses single-checkpoint averages |
Broad, multilingual data coverage | Robustness and cross-lingual transfer | Consistent SOTA results for <1B params |
The combination of these practices leads to an embedding model that is compact (494M parameters, 896-dim vector), robust, and deployable in low-resource or production settings without major sacrifices in performance.
6. Practical Implications and Usage
KaLM-Embedding-V2 is designed for deployment across a range of retrieval and inference intensive applications, including but not limited to:
- Text and document retrieval in multi- or cross-lingual scenarios
- Knowledge graph completion, entity/relation retrieval, and question answering
- Embedding for clustering, classification, and semantic similarity search (dense retrieval)
- Integration as a plug-and-play sentence embedding service within larger language or RAG systems
The model’s configuration (mean pooling, instruction prompts) and open-source checkpoint availability suggest straightforward integration with modern NLP frameworks and pipelines. Its performance/size ratio renders it suitable for cost- or latency-sensitive environments, especially those requiring strong multilingual capability.
KaLM-Embedding-V2 represents a refinement of both architectural alignment and training methodology for embedding models, substantiated by broad multilingual data curation, effective negative sampling, and a generalization-focused model averaging strategy. It offers a practical and highly competitive solution for diverse embedding needs, frequently equaling or outperforming much larger models on public benchmarks.
Official resources include the HIT-TMG/Hugging Face KaLM-Embedding-Multilingual-Mini-Instruct-V2 model and project GitHub repository.