Analysis of Machine Translation Evaluation through Zero-Shot Paraphrasing
This paper presents a novel approach for evaluating machine translation (MT) systems using a sequence-to-sequence paraphraser trained as a multilingual neural machine translation (NMT) system. By reframing paraphrasing as a zero-shot translation task, the authors aim to provide a robust metric for assessing translation quality without relying on human judgments during training. Their system, trained in 39 languages, demonstrates an ability to outperform or statistically tie with existing MT evaluation metrics across most language pairs in the WMT 2019 segment-level shared task.
Key Findings and Methodology
The primary innovation lies in treating sentential paraphrasing as zero-shot translation within a multilingual NMT architecture. By adopting this perspective, the paraphraser implicitly rewards MT output that closely aligns lexically and syntactically with human references, a reflection of what the authors term a "lexically/syntactically unbiased paraphraser". Notably, this approach circumvents the need for human judgment datasets in training, leveraging parallel bitext across multiple languages to establish a universal metric.
When tested on the WMT 2019 dataset, the multilingual model consistently surpasses traditional benchmarks such as BLEU, which have shown diminishing correlation with human judgment as MT systems improve. Prism-ref, the proposed metric based on referencing, demonstrates superior correlation in most language pairs compared to existing methods, including BERTscore and BLEURT. Furthermore, Prism-src, the source-conditioned variant for quality estimation (QE), significantly outperforms submissions in the WMT 2019 QE shared task without using reference data.
Implications and Future Directions
The implications of this research are profound, offering a scalable solution to the increasing complexity and diversity of MT evaluation. The model's ability to effectively assess strong MT systems suggests potential applications in real-time translation quality assessment, enabling rapid iteration and improvement without extensive human oversight.
Moving forward, the scope for refinement and expansion is substantial. The potential to extend this methodology to document-level evaluation aligns with current trends advocating for broader contextual considerations in translation. Moreover, as stronger multilingual models emerge, further gains in evaluation accuracy and efficiency are expected.
Conclusion
This paper represents a significant shift towards leveraging multilingual NMT models for automatic evaluation, offering a more resilient and versatile framework compared to legacy metrics. As multilingual training methods continue to evolve, their utility in creating robust, language-agnostic evaluation tools will likely catalyze advancements in both MT systems and broader NLP applications. The release of the model and toolkit paves the way for further exploration and collaboration within the MT research community.