Knowledge-Prompted Estimator (KPE) in MT Quality
- The paper introduces KPE, a chain-of-thought prompting method that decomposes segment-level quality estimation into fluency, token adequacy, and semantic similarity.
- It achieves state-of-the-art Kendall τ scores by aggregating independent LLM responses from perplexity, cosine similarity, and embedding-based evaluations.
- KPE enhances explainability through precise token alignment visualizations while managing computational overhead from multiple inference calls.
The Knowledge-Prompted Estimator (KPE) is a Chain-of-Thought (CoT) prompting-based approach for explainable segment-level machine translation (MT) quality estimation. Unlike previous methods relying on traditional neural models or simple one-step LLM prompting, KPE systematically decomposes the problem into fluency, token-level adequacy, and sentence-level semantic similarity, explicitly eliciting intermediate reasoning from LLMs to improve both accuracy and interpretability in MT assessment (Yang et al., 2023).
1. Motivation and Context
Segment-level machine translation quality estimation (QE) requires assigning quality scores to individual sentence pairs, presenting unique challenges relative to system-level QE. Human segment-level judgments are intrinsically noisy and multi-dimensional—typically balancing fluency against adequacy—which have proven challenging for both traditional neural QE models (e.g., Predictor-Estimator pipelines) and recent LLM-based methods. Such traditional pipelines demand substantial supervised data, struggle with cross-lingual generalization, and lack inherent interpretability. One-step prompting approaches using LLMs—such as GEMBA—attain strong results at the system level, but tend to underperform at the sentence level; they collapse all reasoning into a single prompt, which constrains the LLM’s ability to explicitly address different quality facets. For instance, GEMBA’s segment-level Kendall τ (28.8%) is below that of specialist metrics such as TeacherSim-LM (29.0%) (Yang et al., 2023).
2. Chain-of-Thought Prompting Design
KPE adopts a structured Chain-of-Thought methodology to address segment-level QE. The pipeline comprises several stages:
- Generate three independent one-shot LLM responses for each segment:
- Prompt1: Computes perplexity for fluency.
- Prompt2: Aligns tokens and scores adequacy via token-level similarity.
- Prompt3: Assesses sentence-level semantic similarity.
- CoT1: Feeds Prompt1 (fluency) and Prompt2 (token adequacy) outputs back to the LLM, requesting a stepwise joint reasoning process and a final quality verdict.
- CoT2: Similarly combines all three (Prompt1, Prompt2, Prompt3) in a three-step reasoning chain.
- The final segment-level score is parsed from the LLM’s response, typically in a 1–5 categorical format.
Explicitly structuring the prompt—e.g., “First, assess fluency by perplexity. Second, judge word-level adequacy. Finally, aggregate.”—induces stepwise, interpretable LLM reasoning. Empirically, CoT1 (using only fluency and token adequacy) provides higher Kendall τ than CoT2, indicating diminishing returns when adding more steps or dimensions (Yang et al., 2023).
3. Constituent Scoring Dimensions
KPE’s one-shot prompts each target a specific quality dimension:
| Prompt | Targeted Aspect | Computation Method |
|---|---|---|
| Prompt1 | Fluency (Perplexity) | |
| Prompt2 | Token-level adequacy | |
| Prompt3 | Sentence-level similarity |
- Prompt1: Computes and normalizes perplexity, serving as a fluency-based indicator.
- Prompt2: Uses cosine similarity between source and translation token embeddings, yielding a summary adequacy score .
- Prompt3: Employs sentence-level embeddings to calculate global semantic similarity between source and target.
Each prompt is independently executed, returning a scalar judgment, which is subsequently aggregated in the Chain-of-Thought stage.
4. Aggregation and Final Scoring
Although KPE employs implicit aggregation via LLM reasoning, the final segment score may be conceptualized as a weighted sum:
In the pipeline, weights are not explicitly trained; rather, their effects are embedded in the stepwise prompts and LLM completions. The authors highlight that using only fluency and token adequacy in CoT1 yields the best segment-level correlation; adding sentence-level similarity (CoT2) does not yield further improvement at this granularity. The paper does not detail any explicit weight-tuning procedure, though such optimization could be performed on held-out DA-judged sets (Yang et al., 2023).
5. Experimental Evaluation
KPE is evaluated on the WMT18 News segment-level “to-en” Relative Ranking (RR) dataset, comprising approximately 77,000 DE→EN segments, alongside six other language directions (CS, ET, FI, RU, TR, ZH→EN). The primary evaluation metric is Kendall τ correlation against human segment-level pairwise rankings.
| Method | Segment-level Kendall τ (%) |
|---|---|
| Prompt1 (PPL) | 25.7 |
| Prompt2 (Token) | 18.8 |
| Prompt3 (Sent) | 17.1 |
| CoT1 (PPL+Token) | 29.1 (SOTA) |
| CoT2 (All) | 28.9 |
| TeacherSim-LM | 29.0 |
| GEMBA | 28.8 |
All three KPE one-step prompts outperform traditional deep-learning models (e.g., PLM, LASER, XMoverScore). CoT1 achieves the highest observed Kendall τ for any LLM-based QE system on this suite.
6. Alignment Visualization and Explainability
To address interpretability, KPE provides visualizations of token-to-token alignment scores () computed during Prompt2. Comparative analysis with M-BERT BertScore alignment (typically diffuse, with >90% matches across the board) and TeacherSim (improved but still “leaky” with spurious matches) demonstrates that KPE produces sharper, more semantically accurate alignments. For instance, punctuation aligns nearly exclusively with itself, and high-probability matches correspond to semantically equivalent token pairs. Although detailed alignment error metrics are not reported, qualitative assessments indicate substantially lower error rates. This enables users to inspect the contribution of specific source–target token correspondences to the overall adequacy judgment (Yang et al., 2023).
7. Computational Considerations and Reproducibility
KPE incurs a computational cost proportional to the number of LLM inferences per segment: two to three calls per segment (one per one-shot prompt, plus each CoT aggregation), making it roughly 2–3 times more expensive than GEMBA’s single-call setup. This overhead is considered tractable at segment-level batch sizes. The paper does not present explicit GPU-hour or latency statistics. According to the authors, code is intended for public release upon publication; however, no repository link is included in the version described (Yang et al., 2023).