GEMBA: LLM-Based MT Evaluation Metrics
- GEMBA is a family of large language model-based evaluation metrics that provide both reference-based and reference-free assessments for machine translation quality.
- The methodology employs zero-shot prompting, document-level reranking, and prompt compression to achieve state-of-the-art performance and cost efficiency.
- Extensions like GEMBA-MQM enable detailed error span identification and applications in diverse fields such as sign language translation, driving advances in MT evaluation.
GEMBA is a family of LLM-based evaluation metrics designed for fine-grained, reference-based and reference-free assessment of machine translation (MT) quality. Originating from the GPT family but now encompassing various LLM architectures, GEMBA metrics have demonstrated state-of-the-art performance on system-level ranking, quality estimation, and error detection tasks through prompting-based, zero-shot evaluation. Multiple variants—including GEMBA-DA (Direct Assessment), GEMBA-MQM (MQM error annotation), and document-level adaptations—enable both segment-level and document-level analysis. GEMBA metrics underpin many of the latest advances in quality estimation reranking and cost-efficient evaluation.
1. Core GEMBA Formulation and Methodology
At its foundation, GEMBA is a zero-shot, per-segment or per-document prompting pipeline. Given a source segment (or full document), a translation hypothesis , and optionally a human reference , a backbone LLM is prompted—instructing it to output a continuous quality score (typically in ) or error annotations. The principal prompt variants are:
- Reference-Based (Quality Metric): The LLM receives , , and , producing a scalar score indicating translation adequacy and fluency.
- Reference-Free (Quality Estimation): The LLM inputs only and , estimating score or error spans without any ground-truth reference.
The typical direct assessment prompt is:
1 2 3 4 |
Score the following translation from {source_lang} to {target_lang} on a continuous scale from 0 to 100, where 0 means 'no meaning preserved' and 100 means 'perfect meaning and grammar'.
{source_lang} source: "{source_seg}"
{target_lang} translation: "{target_seg}"
Score: |
GEMBA enforces numeric outputs within the valid range, retrying sampling with gradually raised temperature if necessary to reduce LLM instability. Segment scores are aggregated system-wise as
In GEMBA-DA, this direct-assessment scoring is applied either at the segment or document level, with document-level scores facilitating candidate reranking for long-context MT scenarios (Mrozinski et al., 10 Oct 2025, Kocmi et al., 2023).
2. GEMBA-MQM: Error Span Identification and MQM Scoring
GEMBA-MQM extends the core framework by instructing the LLM to behave as an MQM (Multidimensional Quality Metrics) annotator, marking error spans according to the standard MQM taxonomy—accuracy, fluency, style, terminology, locale, and severity classes (critical, major, minor). The default three-shot prompt structure is invariant across language pairs and includes:
- A system instruction framing the LLM as a quality annotator.
- Three few-shot annotations demonstrating how to list error spans, MQM types, and severities.
- The candidate source-target pair to be evaluated.
The scoring for a segment given identified error span set is:
with standard coefficients , , .
System-level and segment-level accuracy is evaluated by pairwise system ranking agreement with human MQM judgments and metrics such as system-level Pearson correlation and segment-level Kendall’s (Kocmi et al., 2023).
| MQM Severity | Weight Assignment |
|---|---|
| critical | 25 |
| major | 5 |
| minor | 1 |
3. GEMBA in Document-Level and Quality Estimation Reranking
GEMBA-DA has been applied to document-level MT reranking, particularly in scenarios requiring selection from translation candidates per source document. Here, the scoring function for (source document , candidate ) is:
implemented as a zero-shot, direct assessment-style prompt sent to a long-context LLM such as Gemma 3 27B. GEMBA-DA enables efficient reranking without any additional fine-tuning, leveraging the LLM’s contextual capabilities. Experimentally, reranking with GEMBA-DA produces significant BLEURT-20 improvements—up to +4.30 BLEURT-20 for document candidates, with runtime overhead consistently under 20% of total translation+QE pipeline for modest pool sizes (Mrozinski et al., 10 Oct 2025).
| Pool Size | BLEURT-20 Gain (GEMBA-DA) |
|---|---|
| 2 | +1.63 |
| 32 | +4.30 |
Document-level GEMBA-DA is competitive with top learned metrics while retaining advantages such as training-free deployment, native handling of long contexts, and minimal generation overhead.
4. Cost-Efficient Evaluation: Prompt Compression and Batch Processing
The token and API cost of prompting-based LLM metrics is a critical consideration for large-scale evaluation. Several approaches have been introduced to mitigate these costs:
- PromptOptMe: A two-stage pipeline that employs a smaller, fine-tuned LM to compress prompt inputs for GEMBA-MQM, preserving error-relevant spans and context. Preference optimization (ORPO objective) is used to maintain evaluation quality after compression. This achieves a token reduction for GPT-4o with no loss of system-level ranking or segment-level correlation (Larionov et al., 2024).
- BatchGEMBA-MQM: Aggregates multiple translation examples into a single batch-prompt, reducing token usage by 2–4 depending on batch size, with prompt-compression models delivering an additional 13–15% token savings. Compression mitigates the evaluation degradation seen with large batches and retains over 90% of baseline Pearson at batch size 4 (Larionov et al., 4 Mar 2025).
| Model | B=1 (ref) | B=2 | B=4 | B=8 |
|---|---|---|---|---|
| GPT-4o | 5.4M | 4.1M | 2.8M | 2.1M |
| +Compression | 4.7M | 3.8M | 2.6M | 1.9M |
Prompt compression, when paired with batching, is essential for scalable deployment of GEMBA-MQM in production-class evaluation pipelines.
5. Applications Beyond Text MT: Sign Language Translation and Semantic Evaluation
GEMBA’s prompting paradigm has been leveraged to assess sign language translation (SLT) outputs, addressing the limitations of traditional lexical-overlap metrics such as BLEU or chrF. GEMBA’s LLM backbone enables semantic scoring robust to paraphrasing and hallucination, ranking models consistently even under variation of sentence length. Notably, GEMBA scores increase under LLM-paraphrased outputs (paraphrase bias), and, while detecting major hallucinations, can under-penalize fluent but subtly incorrect outputs (Yazdani et al., 29 Oct 2025).
GEMBA’s high correlation with BLEURT (Pearson ≈ 0.85–0.9), but low correlation with BLEU/chrF/ROUGE, highlights its semantic measurement, in contrast to surface-level lexical metrics.
6. Limitations, Biases, and Recommendations
While GEMBA offers demonstrable gains, several caveats exist:
- Prompt/Model Dependence: Output distributions are sensitive to prompt wording. Minimal prompts have empirically proven robust, but systematic tuning is often necessary per LLM and task.
- Reproducibility: When using proprietary APIs (e.g., GPT-4), model changes can drift performance. All experiments to date have been in high-resource languages; low-resource generalization is untested (Kocmi et al., 2023).
- Cost: Evaluating large corpora with full prompts is computationally expensive; batching and prompt compression should be adopted where feasible (Larionov et al., 2024, Larionov et al., 4 Mar 2025).
- Biases: GEMBA and similar LLM-based metrics consistently reward paraphrased, fluent outputs—even when surface changes do not correspond to improved semantics. This LLM bias motivates continued research on counteracting over-estimation and achieving more transparent error attribution (Yazdani et al., 29 Oct 2025).
Best practices for GEMBA in evaluation pipelines include prompt compression, moderate-size batching, fallback integration with learned QE encoders for robust coverage, and careful cross-validation of output stability. Open-source implementations and prompt templates are available for many GEMBA variants (Kocmi et al., 2023).
7. Impact and Future Directions
GEMBA metrics have established a new reference point for MT evaluation, achieving state-of-the-art system-level accuracy compared to human MQM rankings—e.g., 96.5% pairwise ranking accuracy with GPT-4 in GEMBA-MQM (Kocmi et al., 2023), and substantial BLEURT-20 gains in document-level and reranking contexts (Mrozinski et al., 10 Oct 2025). Ongoing research focuses on efficiency, transparency, and extensions to new modalities (e.g., SLT), as well as developing open-source LLM alternatives to proprietary black-box evaluators (Larionov et al., 2024, Larionov et al., 4 Mar 2025).
A plausible implication is that as LLM backbones evolve with longer context, multilingual coverage, and cost-conscious deployment (compression, batching), GEMBA will continue to serve as a foundational architecture for reference-based and reference-free MT evaluation, error detection, and reranking in production and research settings.