TurkEmbed4STS: Turkish STS & Hallucination Model

Updated 1 January 2026

The paper introduces TurkEmbed4STS, a Turkish-specific encoder that integrates matryoshka representation learning and token-level hallucination detection to boost STS performance.
It employs sequential NLI pretraining with Turkish datasets and STS fine-tuning using CoSENT loss to achieve state-of-the-art benchmarks.
The model supports efficient long-context processing with dynamic embedding projections, enabling robust retrieval and generative evaluation in Turkish NLP.

TurkEmbed4STS is a Turkish-specific sentence and token encoder derived from the TurkEmbed family, optimized for Semantic Textual Similarity (STS) and further adapted for token-level hallucination detection in Retrieval-Augmented Generation (RAG) systems for Turkish. Developed on top of the gte-multilingual-base architecture, it integrates specialized sequential training and matryoshka representation learning, providing both high-accuracy embeddings and efficient, interpretable, long-context processing. TurkEmbed4STS achieves state-of-the-art performance on Turkish STS benchmarks and supports robust hallucination detection for real-world Turkish RAG applications (Ezerceli et al., 11 Nov 2025, Taş et al., 22 Sep 2025).

1. Model Architecture and Representation Learning

TurkEmbed4STS is based on the gte-multilingual-base encoder (12-layer transformer, hidden size 768, 305 M parameters), supporting sequences up to 8,192 tokens via rotary embeddings and local-global attention mechanisms (Ezerceli et al., 11 Nov 2025). The standard use involves mean-pooling the final hidden states to obtain a fixed 768-dimensional vector for each sentence. For scalability, TurkEmbed4STS introduces matryoshka representation learning, which allows dynamic projection into lower-dimensional subspaces (64, 128, 256, 512, or 768 dimensions) by optimizing a set of linear projections $z_d = P_d(h_L)$ for $d \in \{64, 128, 256, 512, 768\}$ , where $h_L$ is the pooled top-layer hidden state. All projections are trained jointly, enabling inference-time truncation with graceful degradation in retrieval and similarity tasks.

For hallucination detection, TurkEmbed4STS diverges from sentence-level regression models by (i) retaining all token-level representations and (ii) appending a lightweight linear classification head over the output of each token to assign a binary label (supported/hallucinated). This design enables token granularity in downstream detection, allowing interpretable span-level outputs adaptable to various structured and generative Turkish NLP tasks (Taş et al., 22 Sep 2025).

2. Training Procedure and Datasets

The training pipeline is two-stage for embedding applications (Ezerceli et al., 11 Nov 2025):

Natural Language Inference (NLI) Pretraining: Uses Turkish translations of SNLI and MultiNLI (All-NLI-TR, 482,091 train, 6,802 dev, 6,827 test), with Multiple Negatives Ranking Loss (MNRL) applied per embedding dimension.
Semantic Textual Similarity (STS) Fine-Tuning: Uses the Turkish STS-Benchmark (STSB-TR, 5,749 train, 1,500 val, 1,379 test), optimizing the CoSENT loss and matryoshka losses per subspace.

Training parameters include AdamW optimizer (learning rate $2 \times 10^{-5}$ ), batch sizes 64 (NLI) and 32 (STS), and maximum sequence length 128 for embedding tasks.

For hallucination detection, TurkEmbed4STS is fine-tuned from the STS-specialized checkpoint on a machine-translated Turkish RAGTruth dataset (17,790 train, 2,700 test instances; average input 801 tokens, up to 2,632) (Taş et al., 22 Sep 2025). The input concatenates “[CLS] question [SEP] context documents [SEP] generated answer [SEP],” with only answer tokens labeled (supported=0, hallucinated=1, background=-100). Training uses AdamW (learning rate $1 \times 10^{-5}$ ), batch size 4, 6 epochs, on an A100-40GB GPU.

3. Token-Level Hallucination Detection Mechanism

Hallucination detection is framed as a token-level classification problem. Each generated answer token receives a label:

0: “supported” if the token can be grounded in the retrieved context.
1: “hallucinated” if not supported by any context.

The loss for each answer token $i$ is standard cross-entropy:

$L_i = -[y_i \log p_i + (1-y_i) \log(1 - p_i)]$

with total loss over $N$ answer tokens:

$\text{Loss} = \frac{1}{N} \sum_i L_i$

Input masking ensures no loss is computed over question or context tokens. This enables focused learning on generative outputs, supporting accurate span identification for hallucination analysis.

4. Empirical Results and Comparative Performance

TurkEmbed4STS exhibits strong and balanced performance across multiple Turkish generation and retrieval tasks. On token-level hallucination detection (RAGTruth test set), performance metrics are as follows (Taş et al., 22 Sep 2025):

Task Type	Precision	Recall	F1-Score	AUROC
Summary	0.6325	0.5656	0.5862	0.5656
Data2Text	0.7397	0.7333	0.7365	0.7333
QA	0.7378	0.7382	0.7380	0.7382
Whole set	0.7268	0.7014	0.7132	0.7014

Compared to alternative Turkish hallucination detectors:

ModernBERT: F1 = 0.7266 (slightly higher on the full set, especially in QA precision: 0.7583).
EuroBERT: F1 = 0.7163 (competes in data-to-text, lower overall recall).

For STS evaluation on STSB-TR (Ezerceli et al., 11 Nov 2025):

Model	Emb. Dim	Pearson	Spearman
Emrecan’s Model	768	0.834	0.830
nomic-embed-text-v2-moe	768	0.828	0.834
multilingual-E5-large	1024	0.846	0.854
TurkEmbed4STS	64–768	0.845	0.853

Relative to Emrecan’s model, TurkEmbed4STS yields a +1.1% absolute Pearson and +2.3% Spearman improvement for STS-b-TR. On cross-lingual STS22-TR, TurkEmbed4STS attains 0.646 Pearson/0.668 Spearman (Emrecan: 0.540/0.563).

5. Computational Efficiency and Scalability

TurkEmbed4STS is designed for efficient inference on contemporary hardware and long-context scenarios. For hallucination detection and STS:

Model size: 210–305 M parameters (slight size difference due to classification head integration).
Inference memory: ~10 GB for 8,192-token max-length on GPU.
Latency: Tens of milliseconds per 800-token sequence at batch size 4 (hallucination detection, A100-40GB), ~2 ms/sentence (embedding, FP16, batch=1, T4 GPU) (Taş et al., 22 Sep 2025, Ezerceli et al., 11 Nov 2025).
Matryoshka mechanism: Embedding dimension can be dynamically reduced at inference for improved speed, with modest accuracy loss.

This architecture is more resource-efficient than prompt-based hallucination detection with 70B+ parameter LLMs, especially for real-time or large-batch Turkish RAG applications.

6. Strengths, Limitations, and Error Characterization

Strengths:

Precise token-level supervision enables interpretable hallucination span detection.
Consistent, well-balanced precision and recall in QA and data-to-text; robust across various generative Turkish tasks.
Maintains competitive or superior STS correlation compared to prior art.
Supports extremely long contexts (up to 8,192 tokens).
Embedding output adaptable to application resource constraints.

Limitations:

Summarization remains the hardest detection challenge (F1 ≈ 0.59).
Slightly lower performance than ModernBERT in some QA cases and EuroBERT in data-to-text.
Embedding inference speed and GPU memory usage higher than smaller models (e.g., Emrecan); this suggests potential for further distillation.

Error Analysis: Missed hallucinations are concentrated in subtle baseless introductions, where micro-level context drift is difficult for the model to detect. False positives typically appear for technical terms absent from the retrieved context in data-to-text. There is no dedicated ablation for TurkEmbed4STS within hallucination detection, but its balanced performance-tradeoff profile is notable.

7. Applications, Impact, and Future Directions

TurkEmbed4STS supports a broad suite of Turkish NLP tasks, including:

Token-level hallucination detection in RAG pipelines (Taş et al., 22 Sep 2025).
Semantic search, clustering, passage retrieval, paraphrase mining, and metrics for generative evaluation (Ezerceli et al., 11 Nov 2025).
Downstream adaptation for real-time, resource-constrained, or large-scale Turkish language applications.

The design demonstrates that sequential NLI → STS training, matryoshka embedding, and explicit token-level objectives provide a stable foundation for Turkish-specific reliability in both generative evaluation and embedding retrieval contexts. Further improvement opportunities include model distillation for speed, integration of human-annotated parallel data for robustness, and more dynamic dimension selection in deployment (Ezerceli et al., 11 Nov 2025).