GEMBA Metric: LLM Evaluation for Machine Translation

Updated 4 September 2025

GEMBA is a GPT-based automated evaluation metric that uses zero-shot prompts to elicit continuous or discrete quality judgments from LLMs for machine translation.
It employs four distinct template variants, such as GEMBA-DA and GEMBA-stars, to aggregate segment-level scores that closely align with human assessments.
GEMBA drives efficient, explainable translation evaluation by integrating batching, prompt compression, and the advanced reasoning of high-capacity LLMs.

GEMBA (GPT Estimation Metric Based Assessment) is a family of LLM-based metrics designed for the automated evaluation of machine translation (MT) quality. Introduced as the first MT assessment approach leveraging zero-shot prompting of generative LLMs, GEMBA enables both reference-based and reference-less evaluation by directly eliciting continuous or discrete quality judgments from models such as GPT-3.5, ChatGPT, and GPT-4. GEMBA and its extensions have driven state-of-the-art alignment with human evaluation, catalyzing a shift from traditional MT metrics toward prompt-based, explainable assessment paradigms.

1. Metric Formulation and Prompting Paradigms

GEMBA’s primary innovation lies in using zero-shot prompt templates to instruct an LLM to rate translation quality at the segment level, then aggregate results system-wide. Prompts encode the source sentence, candidate translation, and (optionally) a human reference translation. Four template variants are employed:

GEMBA-DA (Direct Assessment): Asks for a continuous score in $[0,100]$ .
GEMBA-SQM (Scalar Quality Metrics): Also elicits a $[0,100]$ score.
GEMBA-stars: Requests a rating between 1–5 stars.
GEMBA-classes: Requests a discrete assignment to one of five predefined quality categories.

Each template can be instantiated in a reference-based mode (using the human reference) or a reference-less variant. For every segment, the LLM is prompted accordingly; outputs outside the expected range trigger a reprompt with increased randomness (temperature). The final system-level score is the arithmetic mean of segment-level outputs:

$\text{Score}_{\text{system}} = \frac{1}{N} \sum_{i=1}^N \text{Score}_i$

Where $N$ is the number of segments in the evaluated set.

2. Empirical Performance and Language Coverage

GEMBA was benchmarked on the MQM 2022 test set from the WMT22 Metrics shared task, evaluating English→German, English→Russian, and Chinese→English. With GPT-4 and direct assessment (GEMBA-GPT4-DA), state-of-the-art system-level pairwise accuracies were achieved:

Reference-based: 89.8% pairwise accuracy (surpassing MetricX XXL and BLEURT-20)
Reference-less: 87.6% pairwise accuracy

These scores represent the fraction of system pairs for which GEMBA’s ranking matches that of human MQM labels, using the standard definition:

$\text{Accuracy} = \frac{|\{ (i,j) : \operatorname{sign}(m_{ij}) = \operatorname{sign}(h_{ij}) \}|}{|\{ (i,j) \}|}$

Here $m_{ij}$ is the difference between GEMBA scores for systems $i$ and $j$ , $h_{ij}$ the corresponding human label difference.

At the segment level, Kendall’s Tau correlations lagged slightly behind regression-based metrics—largely due to the discrete and tied output distributions. Nevertheless, GEMBA’s system rankings remained robust, exhibiting consistently high correlation with human assessments across all tested language pairs.

3. Model Dependency and Generalization

GEMBA’s effectiveness depends critically on the reasoning capabilities of the underlying LLM. Among nine models tested (GPT-2, Ada, Babbage, Curie, Davinci-002, ChatGPT, Davinci-003, Turbo, GPT-4), only GPT-3.5-level models and above produced valid, high-quality scores. Early GPT-2 and smaller models failed to output responses in the correct scale or produced random guesses. In contrast, GPT-4 demonstrated strong handling of diverse prompting modes and achieved best-in-class results throughout.

This dependency on capable LLMs has practical consequences; deployment quality may vary for under-resourced languages or linguistic phenomena poorly represented in the model’s pretraining data.

Traditional MT metrics such as BLEU, COMET, and BLEURT rely on n-gram overlap, learned regression models, or reference-based embeddings. GEMBA diverges by leveraging zero-shot natural language understanding, allowing both reference-based and reference-less operation.

Recent extensions include the Knowledge-Prompted Estimator (KPE), which improves segment-level discrimination and explainability by:

Decomposing the quality assessment into fluency (perplexity), token-level similarity, and sentence-level similarity via dedicated prompts;
Combining these facets with chain-of-thought reasoning prompts for interpretability and higher segment-level Kendall correlation (e.g., 29.1% for CoT1 vs. 28.8% for GEMBA one-step) (Yang et al., 2023).

Another notable extension, GEMBA-MQM, adapts GEMBA’s principles for fine-grained error span annotation according to MQM, using a fixed three-shot GPT-4 prompt. GEMBA-MQM’s outputs are severity-weighted error counts, and its pairwise system accuracy again matches or surpasses prior state-of-the-art metrics in reference-less settings (Kocmi et al., 2023).

5. Resource Efficiency and Scalability

Prompt-based evaluation with large LLMs incurs substantial computational cost proportional to the total token count processed—especially in settings with thousands of examples and long, complex prompts. This challenge has motivated the development of compression frameworks:

PROMPTOPTME introduces a two-stage fine-tuning process (supervised followed by ORPO preference optimization) for compressing input/prompt pairs prior to LLM evaluation, achieving up to $2.37\times$ token reduction without loss in system-level accuracy (Larionov et al., 20 Dec 2024).
BatchGEMBA-MQM integrates batched prompting—evaluating several translations per single prompt with common instructions and demonstrations—achieving further 2-4 $\times$ efficiency gains (batching), and another 13–15% via batching-aware prompt compression. Token savings are significant while the top-performing LLMs (notably GPT-4o) retain >90% of baseline accuracy at batch size 4, compared to a 44.6% quality drop without compression (Larionov et al., 4 Mar 2025).

The following table organizes key efficiency approaches addressing GEMBA’s scalability:

Approach	Token Reduction	System Accuracy Impact
PromptOptMe	up to 2.37 $\times$	Maintained/improved
BatchGEMBA-MQM	2–4 $\times$ (batch) +13–15% (compression)	Minimal loss (GPT-4o), variable in smaller LLMs

A plausible implication is that careful prompt refinement and batching, when combined with high-capacity LLMs and error-aware compression, enable GEMBA metrics to scale to large datasets and practical deployment constraints while preserving human-level evaluation reliability.

6. Practical Considerations, Limitations, and Reproducibility

GEMBA offers major advantages in translation workflows:

Applicability in both reference-based and quality estimation scenarios;
Zero-shot operation without system-specific tuning;
Suitability for immediate integration into MT system ranking pipelines or as a QA filter.

Notable limitations include:

Diminished segment-level discrimination due to discretized outputs and ties;
Reduced performance on low-resource languages with limited relevant pretraining data;
Reliance on closed/proprietary LLM APIs (notably GPT-4), introducing reproducibility and future availability concerns—model drift or retraining can alter metric properties between experiments (Kocmi et al., 2023).

A critical strength is the open-source release of code, prompt templates, and experimental outputs for GEMBA, augmenting credibility and enabling independent validation.

7. Impact and Evolution in Automated Translation Evaluation

GEMBA and its derivatives have redefined MT evaluation by shifting the paradigm toward explainable, LLM-driven frameworks. Their performance on system-level tasks has set new state-of-the-art benchmarks against MQM-based human annotation, and subsequent work has enhanced interpretability, efficiency, and scalability. A plausible next step is the integration of open-weight LLMs and further prompt engineering to overcome current limitations around cost, proprietary models, and low-resource languages.

The GEMBA metric family thus represents both an apex and a foundation in LLM-driven quality estimation, merging prompt-based assessment with the advanced reasoning and multilingual capacities unique to contemporary generative LLMs.