TurkEmbed4STS: Turkish STS & Hallucination Model
- The paper introduces TurkEmbed4STS, a Turkish-specific encoder that integrates matryoshka representation learning and token-level hallucination detection to boost STS performance.
- It employs sequential NLI pretraining with Turkish datasets and STS fine-tuning using CoSENT loss to achieve state-of-the-art benchmarks.
- The model supports efficient long-context processing with dynamic embedding projections, enabling robust retrieval and generative evaluation in Turkish NLP.
TurkEmbed4STS is a Turkish-specific sentence and token encoder derived from the TurkEmbed family, optimized for Semantic Textual Similarity (STS) and further adapted for token-level hallucination detection in Retrieval-Augmented Generation (RAG) systems for Turkish. Developed on top of the gte-multilingual-base architecture, it integrates specialized sequential training and matryoshka representation learning, providing both high-accuracy embeddings and efficient, interpretable, long-context processing. TurkEmbed4STS achieves state-of-the-art performance on Turkish STS benchmarks and supports robust hallucination detection for real-world Turkish RAG applications (Ezerceli et al., 11 Nov 2025, Taş et al., 22 Sep 2025).
1. Model Architecture and Representation Learning
TurkEmbed4STS is based on the gte-multilingual-base encoder (12-layer transformer, hidden size 768, 305 M parameters), supporting sequences up to 8,192 tokens via rotary embeddings and local-global attention mechanisms (Ezerceli et al., 11 Nov 2025). The standard use involves mean-pooling the final hidden states to obtain a fixed 768-dimensional vector for each sentence. For scalability, TurkEmbed4STS introduces matryoshka representation learning, which allows dynamic projection into lower-dimensional subspaces (64, 128, 256, 512, or 768 dimensions) by optimizing a set of linear projections for , where is the pooled top-layer hidden state. All projections are trained jointly, enabling inference-time truncation with graceful degradation in retrieval and similarity tasks.
For hallucination detection, TurkEmbed4STS diverges from sentence-level regression models by (i) retaining all token-level representations and (ii) appending a lightweight linear classification head over the output of each token to assign a binary label (supported/hallucinated). This design enables token granularity in downstream detection, allowing interpretable span-level outputs adaptable to various structured and generative Turkish NLP tasks (Taş et al., 22 Sep 2025).
2. Training Procedure and Datasets
The training pipeline is two-stage for embedding applications (Ezerceli et al., 11 Nov 2025):
- Natural Language Inference (NLI) Pretraining: Uses Turkish translations of SNLI and MultiNLI (All-NLI-TR, 482,091 train, 6,802 dev, 6,827 test), with Multiple Negatives Ranking Loss (MNRL) applied per embedding dimension.
- Semantic Textual Similarity (STS) Fine-Tuning: Uses the Turkish STS-Benchmark (STSB-TR, 5,749 train, 1,500 val, 1,379 test), optimizing the CoSENT loss and matryoshka losses per subspace.
Training parameters include AdamW optimizer (learning rate ), batch sizes 64 (NLI) and 32 (STS), and maximum sequence length 128 for embedding tasks.
For hallucination detection, TurkEmbed4STS is fine-tuned from the STS-specialized checkpoint on a machine-translated Turkish RAGTruth dataset (17,790 train, 2,700 test instances; average input 801 tokens, up to 2,632) (Taş et al., 22 Sep 2025). The input concatenates “[CLS] question [SEP] context documents [SEP] generated answer [SEP],” with only answer tokens labeled (supported=0, hallucinated=1, background=-100). Training uses AdamW (learning rate ), batch size 4, 6 epochs, on an A100-40GB GPU.
3. Token-Level Hallucination Detection Mechanism
Hallucination detection is framed as a token-level classification problem. Each generated answer token receives a label:
- 0: “supported” if the token can be grounded in the retrieved context.
- 1: “hallucinated” if not supported by any context.
The loss for each answer token is standard cross-entropy:
with total loss over answer tokens:
Input masking ensures no loss is computed over question or context tokens. This enables focused learning on generative outputs, supporting accurate span identification for hallucination analysis.
4. Empirical Results and Comparative Performance
TurkEmbed4STS exhibits strong and balanced performance across multiple Turkish generation and retrieval tasks. On token-level hallucination detection (RAGTruth test set), performance metrics are as follows (Taş et al., 22 Sep 2025):
| Task Type | Precision | Recall | F1-Score | AUROC |
|---|---|---|---|---|
| Summary | 0.6325 | 0.5656 | 0.5862 | 0.5656 |
| Data2Text | 0.7397 | 0.7333 | 0.7365 | 0.7333 |
| QA | 0.7378 | 0.7382 | 0.7380 | 0.7382 |
| Whole set | 0.7268 | 0.7014 | 0.7132 | 0.7014 |
Compared to alternative Turkish hallucination detectors:
- ModernBERT: F1 = 0.7266 (slightly higher on the full set, especially in QA precision: 0.7583).
- EuroBERT: F1 = 0.7163 (competes in data-to-text, lower overall recall).
For STS evaluation on STSB-TR (Ezerceli et al., 11 Nov 2025):
| Model | Emb. Dim | Pearson | Spearman |
|---|---|---|---|
| Emrecan’s Model | 768 | 0.834 | 0.830 |
| nomic-embed-text-v2-moe | 768 | 0.828 | 0.834 |
| multilingual-E5-large | 1024 | 0.846 | 0.854 |
| TurkEmbed4STS | 64–768 | 0.845 | 0.853 |
Relative to Emrecan’s model, TurkEmbed4STS yields a +1.1% absolute Pearson and +2.3% Spearman improvement for STS-b-TR. On cross-lingual STS22-TR, TurkEmbed4STS attains 0.646 Pearson/0.668 Spearman (Emrecan: 0.540/0.563).
5. Computational Efficiency and Scalability
TurkEmbed4STS is designed for efficient inference on contemporary hardware and long-context scenarios. For hallucination detection and STS:
- Model size: 210–305 M parameters (slight size difference due to classification head integration).
- Inference memory: ~10 GB for 8,192-token max-length on GPU.
- Latency: Tens of milliseconds per 800-token sequence at batch size 4 (hallucination detection, A100-40GB), ~2 ms/sentence (embedding, FP16, batch=1, T4 GPU) (Taş et al., 22 Sep 2025, Ezerceli et al., 11 Nov 2025).
- Matryoshka mechanism: Embedding dimension can be dynamically reduced at inference for improved speed, with modest accuracy loss.
This architecture is more resource-efficient than prompt-based hallucination detection with 70B+ parameter LLMs, especially for real-time or large-batch Turkish RAG applications.
6. Strengths, Limitations, and Error Characterization
Strengths:
- Precise token-level supervision enables interpretable hallucination span detection.
- Consistent, well-balanced precision and recall in QA and data-to-text; robust across various generative Turkish tasks.
- Maintains competitive or superior STS correlation compared to prior art.
- Supports extremely long contexts (up to 8,192 tokens).
- Embedding output adaptable to application resource constraints.
Limitations:
- Summarization remains the hardest detection challenge (F1 ≈ 0.59).
- Slightly lower performance than ModernBERT in some QA cases and EuroBERT in data-to-text.
- Embedding inference speed and GPU memory usage higher than smaller models (e.g., Emrecan); this suggests potential for further distillation.
Error Analysis: Missed hallucinations are concentrated in subtle baseless introductions, where micro-level context drift is difficult for the model to detect. False positives typically appear for technical terms absent from the retrieved context in data-to-text. There is no dedicated ablation for TurkEmbed4STS within hallucination detection, but its balanced performance-tradeoff profile is notable.
7. Applications, Impact, and Future Directions
TurkEmbed4STS supports a broad suite of Turkish NLP tasks, including:
- Token-level hallucination detection in RAG pipelines (Taş et al., 22 Sep 2025).
- Semantic search, clustering, passage retrieval, paraphrase mining, and metrics for generative evaluation (Ezerceli et al., 11 Nov 2025).
- Downstream adaptation for real-time, resource-constrained, or large-scale Turkish language applications.
The design demonstrates that sequential NLI → STS training, matryoshka embedding, and explicit token-level objectives provide a stable foundation for Turkish-specific reliability in both generative evaluation and embedding retrieval contexts. Further improvement opportunities include model distillation for speed, integration of human-annotated parallel data for robustness, and more dynamic dimension selection in deployment (Ezerceli et al., 11 Nov 2025).