TurkEmbed4Retrieval: Turkish Retrieval Embedding Model
- TurkEmbed4Retrieval is a retrieval-specialized embedding model for Turkish that leverages Matryoshka representation learning and large-batch cached negatives.
- It modifies the GTE-multilingual-base architecture by freezing early layers and fine-tuning deeper layers to enhance semantic discrimination for retrieval tasks.
- Empirical evaluations on datasets like SciFact-TR and MS-Marco-TR demonstrate significant improvements in recall and overall retrieval performance.
TurkEmbed4Retrieval is a retrieval-specialized embedding model for the Turkish language, derived by fine-tuning the GTE-multilingual-base encoder. It extends the TurkEmbed family, originally developed for tasks such as Natural Language Inference (NLI) and Semantic Textual Similarity (STS). Leveraging Matryoshka representation learning and a cached multiple negatives ranking loss, TurkEmbed4Retrieval establishes new performance benchmarks for Turkish information retrieval, significantly surpassing prior models such as Turkish-colBERT on datasets including SciFact-TR. The model is structured to support scalable deployment and adaptation to a variety of retrieval and retrieval-augmented generation (RAG) scenarios in Turkish.
1. Model Architecture and Retrieval-Specific Modifications
TurkEmbed4Retrieval is built upon GTE-multilingual-base, a 305M-parameter encoder with 12 Transformer layers, each with a hidden size of 768, 12 attention heads, and a feed-forward dimension of 3,072. The base model is pretrained with the Matryoshka Representation Learning approach to generate nested, multi-level embeddings.
Key retrieval-specific architectural modifications include:
- Removal of task-specific classification heads, replaced by a single dense projection of dimension 768, followed by normalization of output vectors.
- Fine-tuning with a "Cached Multiple Negatives Ranking" (MNR) strategy, enabling large effective batch sizes (cache size up to 1,024) on compatible hardware (such as a single A100, 40GB GPU).
- Freezing of Transformer layers 1–3 to maintain multilingual morphological representations, while jointly fine-tuning layers 4–12 and the final projection head, optimizing deeper layers for retrieval discrimination.
This modular configuration enables flexible adaptation of the encoder for retrieval-focused applications, preserving both coarse linguistic and fine-grained semantic features.
2. Matryoshka Representation Learning
The model employs Matryoshka Representation Learning, which trains embeddings at multiple levels of network depth ("nested representations"). Let denote the sub-embedding at level . The objective optimizes discrimination between positive and negative pairs at every level:
where is the number of representation levels, weights early (coarse) versus late (fine) layer contributions, and is a distance metric such as squared Euclidean or cosine distance.
At inference, representations are either concatenated across levels or taken from the final layer, depending on downstream requirements. Early layers encode broad-topic similarities, while higher layers capture lexical and syntactic nuances. The inclusion of Matryoshka objectives leads to measurable improvement in downstream retrieval; ablation removing Matryoshka reduced Recall@10 by approximately 2 percentage points.
3. Large-Batch Cached Multiple Negatives Ranking Loss
Fine-tuning is performed with a batch-contrastive MNR loss, utilizing in-batch negatives (other positives in the batch) and cached negatives (embeddings from prior batches), scaling the pool of negatives up to 1,024 per step:
where (cosine similarity, as all vectors are -normalized), and temperature (empirically optimal for Turkish retrieval in these experiments).
Cached negatives enable effective large-batch training without corresponding hardware memory costs, stabilizing contrastive learning signals. Reducing cache size from 1,024 to 256 leads to increased loss variance and approximately 3 points drop in MRR@10, highlighting the importance of large negative pools.
4. Training Procedure and Datasets
Training is conducted in three stages:
- Stage 1: NLI pre-fine-tuning on ALL-NLI-TR (482K pairs), MNR loss with , learning rate , batch 128, for 3 epochs.
- Stage 2: STS pre-fine-tuning on STSb-TR (5.7K pairs), CoSENT loss, learning rate , batch 32, for 10 epochs.
- Stage 3: Retrieval fine-tuning on MS-Marco-TR (M query–positive–negative triples). Cached MNR loss with , batch size of 1,024 with gradient accumulation, learning rate (linear decay, 10% warm-up), weight decay 0.01, and three epochs. No data augmentation is applied beyond hard negative sampling from the cache.
This sequential regimen ensures both robust language understanding and retrieval discrimination.
5. End-to-End Retrieval Pipeline
The deployment workflow is as follows:
- Offline embedding: Encode the target corpus (MS-Marco-TR or SciFact-TR) using TurkEmbed4Retrieval, resulting in 768-dimensional -normalized passage embeddings.
- Index construction: Build a FAISS Flat-IP (inner product) index over all embeddings for efficient large-scale similarity search.
- Query processing: At inference, encode user query to obtain , and perform approximate nearest neighbor search in the FAISS index for top-K ( typical) matches.
- Retrieval: Return documents corresponding to highest inner product similarity.
For production, storing embeddings in 16-bit floating point and leveraging FAISS's HNSW option can further improve memory and latency efficiency.
6. Empirical Outcomes and Ablation Insights
On the SciFact-TR test set (6.3K queries), TurkEmbed4Retrieval achieves substantial gains over Turkish-colBERT, establishing new benchmarks:
| Metric | TurkEmbed4Retrieval | Turkish-colBERT | Improvement (pp) |
|---|---|---|---|
| Recall@1 | 0.7116 | 0.4838 | +19.26 |
| Recall@5 | 0.9003 | 0.6785 | +32.68 |
| Recall@10 | 0.9417 | 0.7552 | +24.73 |
| MRR@10 | 0.8221 | 0.5688 | +28.25 |
| NDCG@10 | 0.8484 | — | — |
Ablation studies:
- Removing Matryoshka multi-layer training yields a 2 point drop in Recall@10.
- Reducing cache size from 1,024 to 256 increases training instability and decreases MRR@10 by approximately 3 points.
On MS-Marco-TR, fine-tuning further improves retrieval metrics:
| Metric | Pre-fine-tune | Post-fine-tune |
|---|---|---|
| Recall@1 | 0.7400 | 0.7533 |
| Recall@5 | 0.8700 | 0.9067 |
| Recall@10 | 0.8967 | 0.9433 |
| MRR@10 | 0.7957 | 0.8221 |
| NDCG@10 | 0.8169 | 0.8484 |
| MAP@100 | 0.7937 | 0.8188 |
These results underline the importance of both large-batch contrastive loss and Matryoshka training for retrieval effectiveness.
7. Best Practices, Limitations, and Future Directions
For optimal deployment in Turkish retrieval settings:
- Use batch sizes (cache) ≥ 512 with a temperature in the range 0.05–0.07 during contrastive fine-tuning.
- Freeze early Transformer layers to retain cross-lingual morphology, fine-tuning upper layers and dense projection for retrieval signals.
- Store embeddings using low-precision formats; use FAISS Flat-IP or HNSW for scalable similarity search.
- Refresh negative caches with live hard negatives drawn from recent query feedback.
Potential avenues for extension include domain-adaptive pretraining on Turkish scientific or news corpora, integration of late interaction models or cross-encoders for reranking, utilization of synthetic query generation to amplify data in low-resource domains, and integration into multi-stage retrieval pipelines for RAG architectures targeting Turkish open-domain question answering.
TurkEmbed4Retrieval defines a new state-of-the-art in Turkish passage retrieval, driven by Matryoshka multi-scale embeddings and scalable large-batch negative sampling strategies, with an architecture and training paradigm readily adaptable for future advances in Turkish language retrieval and question answering.