Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

There's no Data Like Better Data: Using QE Metrics for MT Data Filtering (2311.05350v1)

Published 9 Nov 2023 in cs.CL

Abstract: Quality Estimation (QE), the evaluation of machine translation output without the need of explicit references, has seen big improvements in the last years with the use of neural metrics. In this paper we analyze the viability of using QE metrics for filtering out bad quality sentence pairs in the training data of neural machine translation systems~(NMT). While most corpus filtering methods are focused on detecting noisy examples in collections of texts, usually huge amounts of web crawled data, QE models are trained to discriminate more fine-grained quality differences. We show that by selecting the highest quality sentence pairs in the training data, we can improve translation quality while reducing the training size by half. We also provide a detailed analysis of the filtering results, which highlights the differences between both approaches.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Jan-Thorsten Peter (5 papers)
  2. David Vilar (12 papers)
  3. Daniel Deutsch (28 papers)
  4. Mara Finkelstein (13 papers)
  5. Juraj Juraska (17 papers)
  6. Markus Freitag (49 papers)
Citations (13)

Summary

We haven't generated a summary for this paper yet.