Papers
Topics
Authors
Recent
Search
2000 character limit reached

Rewrite to Translate, Translate to Reward: Reinforcement Learning for Source Rewriting in Machine Translation

Published 6 Jun 2026 in cs.CL and cs.AI | (2606.08011v2)

Abstract: Rewriting source text with LLMs before translation has been shown to improve machine translation (MT) quality. However, we find that prompt-based rewriting can degrade translation quality rather than improve it, particularly when smaller LLMs, such as 4B-parameter models, are used. We argue that this limitation stems from the difficulty of controlling rewriting behavior through natural-language prompts alone: a rewrite is useful only if it improves downstream translation, yet existing prompt-based methods do not explicitly optimize for this signal. To address this issue, we propose RLSR (Reinforcement Learning for Source Rewriting), a reinforcement learning framework that trains the rewriting model with a reward based on the downstream translation-quality improvement produced by each rewrite. Experiments across six MT systems and 16 language pairs show that our 4B RLSR-trained rewriting models significantly outperform both the no-rewriting baseline and prompt-based rewriting baselines at the same model scale, while remaining competitive with baselines that use a 235B LLM.

Summary

  • The paper introduces RLSR, a reinforcement learning-based approach that directly optimizes translation quality improvements via reward-driven source rewriting.
  • It employs on-policy RL with DAPO to overcome non-differentiable reward signals, yielding significant gains over prompt-based rewriting in compact models.
  • Empirical results show that RLSR generalizes across multiple MT systems and achieves competitive performance with larger, more resource-intensive models.

Reinforcement Learning for Source Rewriting in Machine Translation

Motivation and Problem Formulation

The paper "Rewrite to Translate, Translate to Reward: Reinforcement Learning for Source Rewriting in Machine Translation" (2606.08011) addresses the problem of source-side rewriting—modifying the input to a fixed machine translation (MT) system in order to improve translation quality, particularly in scenarios where MT models are black-box and their internals inaccessible. Prior approaches in source rewriting leveraged LLMs with prompt-based instructions (e.g., for simplification or paraphrasing). However, empirical results reveal that prompt-based rewriting with smaller LLMs (e.g., 4B parameters) can degrade translation quality, with the lack of direct reward-driven optimization yielding unreliable and often counterproductive source modifications.

To overcome this, the authors propose RLSR: Reinforcement Learning for Source Rewriting. RLSR explicitly optimizes the rewriting model to maximize translation quality improvements, measured by automatic evaluation metrics applied to downstream translations produced from rewritten source inputs. Figure 1

Figure 1: Overview of RLSR. The rewriting model generates a rewritten source from the original source. A fixed downstream MT model translates the rewritten source, and an MT metric evaluates the translation. The improvement over translating the original source is used as the reward for optimizing the rewriting model.

Methodology

RL Objective and Reward Definition

The core of RLSR is the reward structure: for a given source sentence ss and reference translation rr, the rewriting model RθR_\theta produces a rewritten source s~=Rθ(s)\tilde{s} = R_\theta(s). The downstream MT model MM outputs translations for both ss and s~\tilde{s}, evaluated via an automatic metric QQ. The reward for each rewrite is defined as

R(s,s~,r)=Q(s,M(s~),r)Q(s,M(s),r)\mathcal{R}(s, \tilde{s}, r) = Q(s, M(\tilde{s}), r) - Q(s, M(s), r)

thus quantifying the marginal translation quality improvement attributable to rewriting.

Training with DAPO

The reward is non-differentiable due to the discrete generation and metric evaluation, prohibiting direct gradient-based optimization. The authors employ on-policy RL, specifically Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO), to train the rewriting model. To regularize the policy and prevent reward hacking, a KL penalty anchors the policy to a reference (pretrained) LLM using the prompt:

Rtotal(s,s~,r)=R(s,s~,r)βlogRθ(s~s)Rref(s~s)\mathcal{R}_{\text{total}}(s, \tilde{s}, r) = \mathcal{R}(s, \tilde{s}, r) - \beta \log \frac{R_{\theta}(\tilde{s}|s)}{R_{\text{ref}}(\tilde{s}|s)}

with rr0 controlling penalty strength. The objective is to maximize the expected adjusted reward.

Empirical Analysis

Baselines and Setup

Evaluation spans six MT systems and 16 language pairs corresponding to the WMT2025 General MT Shared Task, using Qwen3 4B as the rewriting model. Baselines include no-rewriting, prompt-based rewriting (simplification, paraphrase, "easy translate" prompts), and larger LLM variants (up to Qwen3 235B). Metrics are xCOMET, MetricX, and GEMBA-MQM—the first being a learned metric explicitly optimized in RLSR and the latter providing an independent LLM-based evaluation.

Numerical Results and Claims

The main results highlight several bold findings:

  • Prompt-based rewriting with 4B LLMs consistently degrades translation quality across all MT models, confirming that indirect instruction-following is unreliable for compact models.
  • RLSR-trained Qwen3 4B models significantly outperform both prompt-based baselines and the no-rewriting baseline at the same model scale, with improvements robust and statistically significant across metrics and MT models.
  • RLSR-trained 4B models are competitive with prompt-based approaches using 235B LLMs, demonstrating strong parameter efficiency.
  • Evaluation by GEMBA (reference-free) confirms that RLSR improves actual translation quality and does not merely overfit the reward metric.

Comparison to Supervised Fine-Tuning and DPO

Direct supervised fine-tuning (SFT) using best-rewarded rewrites in an offline dataset yields degenerate models predominantly copying the source, failing to produce meaningful rewrites that improve translation quality. Filtering unchanged sources worsens performance due to token-level NLL's failure to isolate crucial edit operations. Direct Preference Optimization (DPO) improves over SFT but is less stable and consistently inferior to RL, failing to match RLSR's on-policy exploration and adaptation.

Cross-MT Model Generalization

The paper demonstrates that an RLSR-trained rewriting model for a particular MT system generalizes extremely well when applied to other MT architectures, requiring only minor or no retraining. Jointly training a rewriting model to optimize a collective reward across multiple MT models yields nearly identical performance to individually optimized models, suggesting that RLSR learns general strategies targeting intrinsic translation obstacles rather than idiosyncratic system preferences.

Behavioral Analysis and Edit Locality

Detailed analysis shows RLSR performs highly-localized edits focused on disfluencies, ambiguities, and non-literal expressions, preserving length and structural fidelity. By contrast, prompt-based baselines enforce wholesale rephrasing, often flattening style or hallucinating content, indicating their lack of precision. Case studies confirm that RLSR intervenes selectively only where translation quality is impeded, otherwise leaving fluent inputs nearly untouched.

Practical and Theoretical Implications

Practically, RLSR offers a scalable method to enhance translation quality via black-box MT systems while avoiding heavy reliance on extremely large LLMs. The framework allows for efficient deployment, needing only a single pass through rewriting and translation models at inference time. Theoretically, this work substantiates RL's superiority for discrete text-editing tasks where reward signals are sparse and impact is highly localized, and it motivates further investigation of reward-driven sampling, credit assignment, and optimization objectives in text generation.

Future Directions

Open questions remain regarding training cost (RLSR is expensive relative to SFT), automated identification of crucial edit tokens for cost-effective supervised objectives, scaling joint optimization to many MT models, and evaluation robustness via human assessment. Research into advanced supervised objectives (e.g., span-weighted token loss) and broader cross-domain adaptation could further strengthen the application of RL for source rewriting in MT.

Conclusion

By directly optimizing for translation quality improvements via reward-driven reinforcement learning, RLSR advances the state of source rewriting, reliably outperforms prompt-based baselines at the same parameter scale, and approaches performance of much larger models. The method demonstrates precise and effective editing policies, robust MT-system generalization, and strong empirical evidence for RL's fitness in text pre-editing tasks. The findings have important implications for the practical enhancement of both commercial and research MT systems, and suggest promising future directions in reward-driven text generation and editing.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.