AI-Assisted Human Evaluation of Machine Translation (2406.12419v3)

Published 18 Jun 2024 in cs.CL

Abstract: Annually, research teams spend large amounts of money to evaluate the quality of machine translation systems (WMT, inter alia). This is expensive because it requires a lot of expert human labor. In the recently adopted annotation protocol, Error Span Annotation (ESA), annotators mark erroneous parts of the translation and then assign a final score. A lot of the annotator time is spent on scanning the translation for possible errors. In our work, we help the annotators by pre-filling the error annotations with recall-oriented automatic quality estimation. With this AI assistance, we obtain annotations at the same quality level while cutting down the time per span annotation by half (71s/error span $\rightarrow$ 31s/error span). The biggest advantage of the ESA$^\mathrm{AI}$ protocol is an accurate priming of annotators (pre-filled error spans) before they assign the final score. This alleviates a potential automation bias, which we confirm to be low. In our experiments, we find that the annotation budget can be further reduced by almost 25% with filtering of examples that the AI deems to be likely to be correct.

Citations (2)

View on Semantic Scholar

Summary

The paper introduces an AI-assisted ESA protocol that reduces annotation time from 71 to 31 seconds and enhances inter-annotator agreement.
It employs a GPT-based GEMBA system to pre-fill error spans, increasing detailed annotations from 0.5 to 1.63 per translation segment.
The approach cuts annotation costs by up to 24% while maintaining high quality, making MT evaluation more scalable and reliable.

AI-Assisted Human Evaluation of Machine Translation

The paper "AI-Assisted Human Evaluation of Machine Translation" investigates the integration of AI into the human evaluation process of machine translation (MT) systems, introducing a novel Error Span Annotation (ESA) protocol that is complemented by automatic quality estimation. This collaborative approach between human annotators and AI aims to enhance both the efficiency and quality of annotations in MT evaluation.

The crux of this research is the use of AI to pre-fill error annotations, thereby serving as a primer for human evaluators. The AI benefits annotators by providing initial error span suggestions using a quality estimation system, specifically the GPT-based GEMBA, which they subsequently refine. One of the significant outcomes is that this AI-assistance reduces the time per error span annotation from 71 seconds to 31 seconds, highlighting a marked improvement in efficiency. Moreover, this approach also enables a reduction in the annotation costs by up to 24% through pre-filtering examples that the AI predicts as likely correct, further implying cost-effectiveness without compromising evaluation quality.

A noteworthy finding of this method is the improvement in inter-annotator agreement and consistency, attributed to the AI-driven pre-annotation process. Enhanced consistency suggests that the protocol conveys a shared understanding of error spans that traditional methods may lack, as indicated by increased agreement scores compared to ESA without AI assistance. Furthermore, the AI-assisted pipeline yielded a significant increase in the number of detailed annotations, with an average of 1.63 error spans per translation segment compared to 0.5 spans in the human-only ESA setup. This richness in annotations provides a more nuanced understanding of translation quality, crucial for system comparison and refinement.

A critical analysis reveals the challenges of over-reliance on AI assistance. Despite this, the findings demonstrate that annotators maintain engagement and accuracy even with AI assistance, as evidenced by the consistent annotation quality throughout the task and the passing rate on attention checks.

This paper also examines the efficacy of using AI-assisted annotations for system comparisons, revealing comparable or improved results to existing human-only methods. System-level correlations with established metrics like WMT are strong, reinforcing the protocol's reliability.

In terms of implications, these findings suggest that incorporating AI can make the traditionally labor-intensive process of MT evaluation more scalable and affordable. By maintaining high annotation quality and reducing bias, AI-augmented evaluations could potentially be extended to other domains within natural language processing where meticulous human assessment is critical. Future research directions could explore refining quality estimation models to minimize potential biases towards any specific MT system architecture.

Overall, the integration of AI into human evaluation processes presents a promising trajectory for enhancing efficiency and consistency, with significant implications for optimizing resource allocation in MT system development and assessment.

PDF Markdown

Related Papers

Tweets

https://twitter.com/zouharvi/status/1890746395662962742

YouTube

Show All Videos