Sequence-to-Sequence Learning as Beam-Search Optimization (1606.02960v2)

Published 9 Jun 2016 in cs.CL, cs.LG, cs.NE, and stat.ML

Abstract: Sequence-to-Sequence (seq2seq) modeling has rapidly become an important general-purpose NLP tool that has proven effective for many text-generation and sequence-labeling tasks. Seq2seq builds on deep neural LLMing and inherits its remarkable accuracy in estimating local, next-word distributions. In this work, we introduce a model and beam-search training scheme, based on the work of Daume III and Marcu (2005), that extends seq2seq to learn global sequence scores. This structured approach avoids classical biases associated with local training and unifies the training loss with the test-time usage, while preserving the proven model architecture of seq2seq and its efficient training approach. We show that our system outperforms a highly-optimized attention-based seq2seq system and other baselines on three different sequence to sequence tasks: word ordering, parsing, and machine translation.

PDF Abstract

Sequence-to-Sequence Learning as Beam-Search Optimization

The paper "Sequence-to-Sequence Learning as Beam-Search Optimization" tackles the entrenched issues within seq2seq models by reframing their training process. Sequence-to-sequence (seq2seq) models have been pivotal in numerous NLP tasks like machine translation and text generation. The existing training paradigm for seq2seq models primarily involves optimizing a word-level loss function. This practice leads to well-documented challenges such as exposure bias and loss-evaluation mismatch. The authors introduce an innovative approach called beam-search optimization (BSO) to address these limitations and propose a non-probabilistic scoring mechanism to train seq2seq models using sequence-level objectives.

Contributions

The key contributions of this paper are as follows:

Reformulation of Seq2seq Modeling: By adopting a non-probabilistic scoring system instead of a probability-based next-word prediction, the authors enable the assignment of global sequence scores. This shift mitigates exposure bias, where the model is only trained on correct histories, by incorporating training sequences generated during beam search.
Integration of Structured Prediction Techniques: Inspired by techniques like learning as search optimization (LaSO), the paper leverages beam search during training to craft a search-based loss function. This function penalizes the model when the correct prefix sequence falls off the beam, consequently exposing the model to non-gold histories.
Incorporation of Hard Constraints: By training the model with hard constraints enforced through a successor function, $\suk$, the method can integrate domain-specific restrictions, enhancing the performance on structured prediction tasks such as parsing.

Experimental Validation

The authors validate their approach across three varied NLP tasks: word ordering, dependency parsing, and machine translation. In each task, they demonstrate significant improvements over a highly-optimized attention-based seq2seq baseline:

Word Ordering: The constrained BSO approach effectively utilizes the allowed permutations of input sentences, yielding higher BLEU scores compared to the baseline.
Dependency Parsing: By applying the structured beam-search training strategy, the model achieves improved UAS and LAS scores, outperforming traditional seq2seq methods.
Machine Translation: The model trained with sequence-level BLEU objectives shows substantial enhancements, highlighting the importance of matching training and evaluation objectives.

Implications and Future Directions

The restructured training methodology proposed in this paper has critical implications for seq2seq applications. By alleviating exposure bias and allowing for the inclusion of sequence-level costs, the method provides significant improvements in generating more coherent and contextually-appropriate sequences.

The potential to extend this approach to larger datasets and more complex architectures, such as transformers, represents a promising avenue for future research. Moreover, integrating reinforcement learning techniques or refining the cost functions could further optimize sequence-level predictions. This paper sets a foundational step towards more robust and comprehensive seq2seq training paradigms by aligning training procedures with sequence evaluation metrics.

In conclusion, the paper offers a substantial refinement in seq2seq training by integrating beam search optimization, potentially broadening the applicability and effectiveness of these models in diverse NLP tasks.

PDF Markdown Bookmark Chat (Pro)

Authors (2)

Sam Wiseman (30 papers)
Alexander M. Rush (115 papers)

Citations (582)

View on Semantic Scholar

Sequence-to-Sequence Learning as Beam-Search Optimization (1606.02960v2)