Automatic Input Rewriting Improves Translation with Large Language Models (2502.16682v2)

Published 23 Feb 2025 in cs.CL and cs.AI

Abstract: Can we improve machine translation (MT) with LLMs by rewriting their inputs automatically? Users commonly rely on the intuition that well-written text is easier to translate when using off-the-shelf MT systems. LLMs can rewrite text in many ways but in the context of MT, these capabilities have been primarily exploited to rewrite outputs via post-editing. We present an empirical study of 21 input rewriting methods with 3 open-weight LLMs for translating from English into 6 target languages. We show that text simplification is the most effective MT-agnostic rewrite strategy and that it can be improved further when using quality estimation to assess translatability. Human evaluation further confirms that simplified rewrites and their MT outputs both largely preserve the original meaning of the source and MT. These results suggest LLM-assisted input rewriting as a promising direction for improving translations.

Collections

Summary

The paper demonstrates that automatic input rewriting using Large Language Models, especially simplification and translatability-aware methods, significantly improves machine translation quality.
Key findings indicate simplification and inference-time selection based on metrics like xCOMET provide substantial translation quality gains across multiple languages, while noting a moderate trade-off with meaning preservation.
Rewriting simplifies source and translated text, offers advantages over post-editing, and results in human-evaluated translations with improved fluency and meaning preservation compared to translating original inputs.

This paper explores the impact of automatic input rewriting using LLMs on Machine Translation (MT) quality. It investigates whether rewriting inputs can improve translation quality from English into six target languages (German, Russian, Chinese, Czech, Hebrew, and Japanese) using open-weight LLMs. The research focuses on identifying effective rewriting strategies and leveraging quality estimation metrics to enhance translatability.

The paper categorizes input rewriting methods into three types:

MT-Agnostic: These methods rewrite inputs without considering translation-related knowledge. Techniques include:
- Simplification: simplifying complex words, rephrasing syntactic structures, and shortening sentences.
- Paraphrasing: rephrasing inputs to normalize language patterns using LLM training data.
- Stylistic: using a text editing tool (CoEdIT-XL) to rewrite inputs according to style specifications such as grammar, coherence, understandability, and formality.
Task-Aware: These methods incorporate information about the MT task into the rewriting process. Techniques include:
- Easy Translation: prompting LLMs to rewrite inputs to facilitate translation into the target language.
- Chain of Thought Rewrite+Translate: prompting LLMs to handle the entire rewriting and translation process in one sequence of instructions.
Translatability-Aware: This method uses quality estimation scores to assess the translatability of inputs at the segment level. Techniques include:
- Inference-Time Selection: using xCOMET to assess the translation quality of a rewrite and comparing it with the original input's translation quality, choosing the rewrite if it yields a higher xCOMET score.
- Supervised Fine-tuning: fine-tuning the base LLM to rewrite inputs for improved translation using a dataset of positive rewrite examples based on xCOMET scores.

The experimental setup involves using Tower-Instruct 7B as the MT system and evaluating the rewriting methods using metrics such as xCOMET and MetricX. The evaluation metrics consider translatability, meaning preservation, and overall translation quality.

The key findings of the paper are:

Simplification is the most effective MT-agnostic rewrite strategy.
Using quality estimation signals to assess translatability and select rewrites further improves MT quality.
Simplified rewrites and their MT outputs preserve the original meaning of the source and MT.

Specifically, the paper presents the following results:

MT-Agnostic rewrites, especially simplification, improve translatability. Simplifying with Tower-Instruct improved translation quality based on xCOMET scores and maintained it according to the MetricX scores.
Inference-time selection based on translatability scores improves translation quality, with average xCOMET gains of 0.024 for English-German (en-de), 0.031 for English-Russian (en-ru), and 0.025 for English-Chinese (en-zh).
A moderate negative correlation between translatability and meaning preservation scores was found, indicating a trade-off between the two metrics.
Simplification and translatability-based selection lead to progressive improvements in translation quality for held-out test sets (English-Czech, English-Hebrew, and English-Japanese), with the selection strategy excelling in language pairs with lower-resource target languages.
Simplification as an input rewriting strategy enhances the readability of both inputs and translation outputs, as measured by the Flesch Reading Ease score and Gunning Fog index.
Rewriting inputs offers an advantage over post-editing outputs. Combining input rewriting and post-editing yields the highest translation quality.
Human evaluation confirms that translations from simplified inputs are rated as more fluent, understandable, and better at preserving the meaning of the reference translation.

The paper concludes that LLM-assisted input rewriting is a promising direction for improving translations. Future work is needed to discover optimal rewriting strategies for a broader range of models and to design richer interactive approaches to translation with LLMs.

The following equations define the source rewriting process:

MT-Agnostic Rewriting:

$s' = \mathcal{M}_{\theta}(s)$

* $s$ : Original source sentence * $s'$ : Rewritten source sentence * $\mathcal{M}_{\theta}$ : Rewrite model with parameters $\theta$

Task-Aware Rewriting:

$s' = \mathcal{M}_{\theta}(s, \text{MT task})$

* $s$ : Original source sentence * $s'$ : Rewritten source sentence * $\mathcal{M}_{\theta}$ : Rewrite model with parameters $\theta$ * $\text{MT task}$ : Information of the Machine Translation task

Translatability-Aware Rewriting:

$s' = \mathcal{M}_{\theta}(s, \text{xCOMET}(s,\text{MT}(t)))$

* $s$ : Original source sentence * $s'$ : Rewritten source sentence * $\mathcal{M}_{\theta}$ : Rewrite model with parameters $\theta$ * $\text{xCOMET}(s,\text{MT}(t))$ : xCOMET score between the source $s$ and the output of a specific MT system $\text{MT}(t)$

PDF Markdown

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Automatic Input Rewriting Improves Translation with Large Language Models (2502.16682v2)

Collections

Summary

Paper Prompts

Follow-up Questions

Related Papers

Authors (2)