Error Norm Truncation: Robust Training in the Presence of Data Noise for Text Generation Models (2310.00840v2)

Published 2 Oct 2023 in cs.CL

Abstract: Text generation models are notoriously vulnerable to errors in the training data. With the wide-spread availability of massive amounts of web-crawled data becoming more commonplace, how can we enhance the robustness of models trained on a massive amount of noisy web-crawled text? In our work, we propose Error Norm Truncation (ENT), a robust enhancement method to the standard training objective that truncates noisy data. Compared to methods that only uses the negative log-likelihood loss to estimate data quality, our method provides a more accurate estimation by considering the distribution of non-target tokens, which is often overlooked by previous work. Through comprehensive experiments across LLMing, machine translation, and text summarization, we show that equipping text generation models with ENT improves generation quality over standard training and previous soft and hard truncation methods. Furthermore, we show that our method improves the robustness of models against two of the most detrimental types of noise in machine translation, resulting in an increase of more than 2 BLEU points over the MLE baseline when up to 50% of noise is added to the data.

Authors (5)

Tianjian Li (7 papers)
Haoran Xu (77 papers)
Philipp Koehn (60 papers)
Daniel Khashabi (83 papers)
Kenton Murray (37 papers)

Citations (2)

View on Semantic Scholar

Summary

The paper demonstrates that Error Norm Truncation robustly filters noisy training data, yielding improved performance over standard MLE in text generation tasks.
It leverages L2 error norms over full token distributions to distinguish clean from noisy data while preserving high-entropy examples.
Empirical results validate the method with BLEU score improvements in translation and summarization tasks even under 50% noisy conditions.

Error Norm Truncation: Robust Training in the Presence of Data Noise for Text Generation Models

The paper "Error Norm Truncation: Robust Training in the Presence of Data Noise for Text Generation Models" by Tianjian Li et al. from Johns Hopkins University addresses an important challenge in neural text generation models—robust training amidst noisy data. The authors propose an innovative method termed "Error Norm Truncation" (ENT), which is a modification of the standard maximum likelihood estimation (MLE) training objective aimed at enhancing model robustness against data noise.

Problem Context and Limitations of Existing Approaches

Text generation models, such as those used in machine translation and text summarization, frequently encounter training data imbued with natural and adversarial noise. Traditional MLE-based approaches mandate the model to maximize the probability of observed data, a practice that proves suboptimal when the data itself contains erroneous entries. Models consequently learn these inaccuracies, leading to degradation in generation quality.

Prior attempts to rectify this issue have involved either altering the autoregressive structure of MLE or adjusting the loss function to down-weigh or exclude data labeled as noisy based on predicted probabilities. However, these strategies typically rely on the prediction of target tokens alone, neglecting the broader distributional data available across all potential tokens (non-target tokens). This narrow focus can mischaracterize high-entropy contexts or instances where models haven't converged sufficiently, resulting in either exclusion of valuable data or improper retention of noisy examples.

Introduction to Error Norm Truncation

To address these challenges, the authors introduce a more nuanced data truncation methodology: Error Norm Truncation. This strategy utilizes the L2 norm—the Error Norm—between the model's predicted probability distribution and the one-hot encoded vector representing the ground truth. This approach considers the entire distribution of predictions, thereby providing a richer estimate of data quality by identifying tokens where the model exhibits strong bias against the ground truth. ENT operates by truncating data associated with high error norms, effectively filtering out noisy instances while retaining useful high-entropy examples.

Methodological Implementation and Advantages

ENT demonstrates superiority by maintaining robustness across various noisy environments, specifically through:

Better separation of clean and noisy data without impacting high-entropy contexts.
Increased resilience to iterations at which truncation commences, providing consistent performance improvements.
Establishing quantitative connections between L2 error norms and total variation distance, offering a theoretically grounded enhancement over MLE's reliance on KL-divergence.

Empirical Validation

The proposed ENT methodology was rigorously tested across several domains:

Machine Translation and LLMing: ENT consistently surpassed traditional MLE and alternative truncation methods, evident in BLEU score improvements of more than two points when faced with 50% noised datasets.
Robustness Against Synthetic Noise: In scenarios involving untranslated text and misordered words, ENT demonstrated pronounced resilience over existing methods, underscoring its applicability to diverse linguistic tasks.
Fine-Tuning for Summarization: When fine-tuning models like T5-small and BART-base, ENT delivered superior performance enhancements over baselines, reinforcing its utility across different model architectures.

Theoretical and Practical Implications

Practically, ENT introduces a scalable, computationally efficient method for enhancing model robustness, paramount for large-scale NLP systems reliant on vast diverse datasets. Theoretically, it sets a precedent for incorporating comprehensive probability distributions into data evaluation metrics, an area potentially beneficial for future research into other forms of model uncertainty and robustness.

Concluding Remarks

The Error Norm Truncation approach represents a notable advancement in training text generation models under noisy conditions. Its ability to filter noise more accurately than previous models suggests promising avenues for both applied and theoretical enhancements in AI. Researchers should explore further integration of such error mechanics with other learning paradigms to bolster model performance in dynamic, real-world applications.

PDF Markdown

Related Papers

Tweets

https://twitter.com/DanielKhashabi/status/1752315768279334981

https://twitter.com/jhuclsp/status/1752052120260616416

YouTube

Show All Videos