Granite Embedding Models

Published 27 Feb 2025 in cs.IR and cs.CL | (2502.20204v1)

Abstract: We introduce the Granite Embedding models, a family of encoder-based embedding models designed for retrieval tasks, spanning dense-retrieval and sparse retrieval architectures, with both English and Multilingual capabilities. This report provides the technical details of training these highly effective 12 layer embedding models, along with their efficient 6 layer distilled counterparts. Extensive evaluations show that the models, developed with techniques like retrieval oriented pretraining, contrastive finetuning, knowledge distillation, and model merging significantly outperform publicly available models of similar sizes on both internal IBM retrieval and search tasks, and have equivalent performance on widely used information retrieval benchmarks, while being trained on high-quality data suitable for enterprise use. We publicly release all our Granite Embedding models under the Apache 2.0 license, allowing both research and commercial use at https://huggingface.co/collections/ibm-granite.

Abstract PDF Upgrade to Chat

Summary

The paper presents Granite Embedding models designed for enterprise text retrieval using a bi-encoder architecture, achieving competitive performance on key benchmarks.
The paper details a training regimen that combines contrastive learning with knowledge distillation, self-distillation, and model merging to enhance model efficiency.
The paper demonstrates that distilled smaller models achieve lower retrieval latency while maintaining accuracy, making them ideal for latency-constrained enterprise applications.

IBM Research AI introduces the Granite Embedding models, a family of encoder-based text embedding models specifically designed for retrieval tasks in enterprise settings. These models support both dense and sparse retrieval architectures and are available for English and multilingual applications. A key aspect of the Granite models is their training on high-quality, curated data with commercial-use permissions, released under the Apache 2.0 license.

The Granite family includes five models:

Dense English Models: granite-embedding-30m-english (6 layers, 384 embedding size) and granite-embedding-125m-english (12 layers, 768 embedding size). These are based on a RoBERTa-like architecture.
Dense Multilingual Models: granite-embedding-107m-multilingual (6 layers, 384 embedding size) and granite-embedding-278m-multilingual (12 layers, 768 embedding size). These use an XLM-RoBERTa-like architecture and are finetuned for 12 target languages, with potential applicability to approximately 100 languages covered by the XLM-RoBERTa vocabulary.
Sparse English Model: granite-embedding-30m-sparse (6 layers, 384 embedding size). This model is adapted for sparse retrieval capabilities in English.

All models use a bi-encoder architecture for generating embeddings. Dense models utilize the final hidden state of the [CLS] token for the embedding representation, which the authors found to be more effective than mean-pooling. The sparse model uses max-pooling across sequence tokens to produce a variable-length, weighted bag-of-words like output. The maximum context length for all models is 512 tokens.

The training methodology for Granite Embedding models is based on contrastive learning, aiming to minimize the distance between embeddings of a query and relevant passages while maximizing distance from non-relevant ones. They employ an enhanced contrastive loss similar to GTE (Li et al., 2023), incorporating bidirectional negative signals.

Training data consists of:

Weakly paired data: Mined from diverse web sources like Wikipedia, StackExchange, Semantic Scholar, Arxiv, PubMed, mC4, multilingual Wikipedia, and Webhose. This data uses in-batch negatives.
High-quality, annotated data: Includes publicly available datasets (NQ, SQuAD, HotpotQA, FEVER, MIRACL, TyDiQA, Sadeem QA) and IBM-internal data targeting technical domains. This data often includes mined or annotated hard negatives. Notably, the widely used MS-MARCO dataset is excluded due to licensing restrictions.
Synthetic data: High-quality multilingual query-passage pairs and hard negatives generated using LLMs for Wikipedia paragraphs. Prompts are designed to create diverse query types.

Hard negatives are mined from high-quality datasets using a pre-existing lightweight embedding model, sampling from the top-100 non-relevant passages and filtering potential false positives based on similarity to the positive passage. Data sampling during training uses a stratified approach, sampling batches from each dataset proportional to its size raised to a power $\alpha$ .

Key techniques used in training include:

Retrieval Oriented Pretraining: Applied to English models, similar to Retro-MAE (Wang et al., 2022). This involves a masked auto-encoding task using an asymmetric encoder-decoder structure.
Knowledge Distillation: Crucial for improving performance, especially for smaller models. The method distills the distribution of similarity scores from a larger teacher model (or via self-distillation) to a smaller student model using a cross-entropy loss on temperature-scaled scores. This technique is effective for transferring knowledge even when teacher and student have different embedding sizes. For datasets without hard negatives, perturbing the positive passage is used to create a richer score distribution for distillation.
Self Distillation: Used for the granite-embedding-278m-multilingual model to further improve performance without external data or models.
Model Merging: Used for the granite-embedding-30m-english model to adapt to specific enterprise domains without degrading general performance (Xiao et al., 2023).

Specific model training flows:

granite-embedding-125m-english: Starts with an in-house WatBERT base, followed by retrieval-oriented pretraining, and then knowledge distillation from a fine-tuned decoder-based LLM (Mistral-7B-Instruct (Jiang et al., 2023)).
granite-embedding-30m-english: Based on a 6-layer WatBERT, trained with RetroMAE-based distillation, then contrastive distillation from a larger encoder teacher, and finally domain adaptation via model merging.
granite-embedding-278m-multilingual: Starts with an in-house XLM-RoBERTa base, undergoes stage-wise contrastive training (6 languages then 12 languages), and concludes with self-distillation.
granite-embedding-107m-multilingual: Distilled from the larger 278M multilingual model in stages (6 languages then 12 languages) using contrastive distillation.
granite-embedding-30m-sparse: Starts with a 6-layer WatBERT, applies RetroMAE-based distillation, and then contrastive KD from the 125M dense teacher. It uses max-pooling for term weights and introduces a NORM loss ( $\mathcal{L}_{NORM}$ ) alongside the standard FLOPS loss for regularization to encourage sparsity. The overall loss is a combination of KD, FLOPS, and NORM losses.

Extensive evaluations were conducted on various benchmarks:

English: BEIR [thakur2021beir], MTEB muennighoff2022mteb, Code Retrieval (COIR [li2024coircomprehensivebenchmarkcode]), and IBM internal benchmarks (ClapNQ [clapnq_rosenthal_2025], Red Hat, Unified Search).
Multilingual: MIRACL [zhang-etal-2023-miracl], Mintaka Retrieval [sen-etal-2022-mintaka], and multilingual retrieval subsets of MTEB.

Results indicate that Granite Embedding models achieve competitive or superior performance compared to publicly available models of similar sizes, particularly on internal IBM tasks and zero-shot code retrieval benchmarks, despite not being trained on datasets like MS-MARCO or specific code retrieval data. The smaller, distilled models (granite-embedding-30m-english, granite-embedding-107m-multilingual) demonstrate significantly lower retrieval latency while maintaining good performance, making them suitable for latency-constrained applications. Performance on Chinese multilingual tasks is noted as lower compared to some competitors, attributed to differences in specific in-domain training data.

The paper concludes by emphasizing the Granite Embedding models' suitability for enterprise retrieval applications due to their performance, efficiency, high-quality data training, and permissive Apache 2.0 license. The authors plan to continuously update these models with performance improvements and new features.

Markdown