Classical Structured Prediction Losses for Sequence to Sequence Learning (1711.04956v5)

Published 14 Nov 2017 in cs.CL

Abstract: There has been much recent work on training neural attention models at the sequence-level using either reinforcement learning-style methods or by optimizing the beam. In this paper, we survey a range of classical objective functions that have been widely used to train linear models for structured prediction and apply them to neural sequence to sequence models. Our experiments show that these losses can perform surprisingly well by slightly outperforming beam search optimization in a like for like setup. We also report new state of the art results on both IWSLT'14 German-English translation as well as Gigaword abstractive summarization. On the larger WMT'14 English-French translation task, sequence-level training achieves 41.5 BLEU which is on par with the state of the art.

Authors (5)

Sergey Edunov (26 papers)
Myle Ott (33 papers)
Michael Auli (73 papers)
David Grangier (55 papers)
Marc'Aurelio Ranzato (53 papers)

Citations (184)

View on Semantic Scholar

Summary

Classical Structured Prediction Losses for Sequence to Sequence Learning

The paper explores the application of classical structured prediction objective functions to neural sequence to sequence (seq2seq) models, evaluating their effectiveness compared to contemporary sequence-level learning strategies. The authors, Edunov et al., leverage a range of classical losses traditionally used for linear models in NLP tasks and apply them to neural models with a particular focus on sequence-level training for tasks such as machine translation and abstractive summarization.

The paper begins by outlining the motivation for sequence-level training in seq2seq models, noting the inconsistency between token-level training and sequence-level inference. Recent sequence-level training methods, including reinforcement learning techniques like REINFORCE and actor-critic, or beam search optimization, are contrasted with classical structured prediction approaches.

The authors revisit several well-established objective functions used in structured prediction, including sequence negative log likelihood (SeqNLL), expected risk minimization (Risk), max-margin, multi-margin, and softmax-margin losses. These losses are analyzed for their efficacy in neural seq2seq models.

The experimental component of the paper is robust, involving multiple NLP tasks: IWSLT'14 German-English translation, Gigaword abstractive summarization, and the large-scale WMT'14 English-French translation task. The results reveal that classical sequence-level losses offer competitive performance when compared to recent innovations in sequence-level optimization such as beam search optimization (BSO). Specifically, the Risk loss demonstrates superior performance, achieving state-of-the-art results on IWSLT'14 and Gigaword tasks.

Key numerical accomplishments include achieving a test BLEU score of 32.84 on the IWSLT'14 German-English translation task and setting a high benchmark in ROUGE scores on the Gigaword summarization task. For the WMT'14 English-French translation task, the authors' models achieve a BLEU score of 41.5, which aligns with the state-of-the-art.

Delving into the practical implications, the paper illustrates that classical structured prediction losses remain viable and competitive approaches in seq2seq training tasks, sometimes rivaling newer methods based on reinforcement learning. The findings prompt a revisitation of classical techniques and suggest their potential integration in optimizing neural models, especially where sequence-level tasks are involved.

The research posits avenues for future inquiries, particularly in enhancing the efficiency of candidate generation, a bottleneck in sequence-level training. The considered models exhibit computationally slower training phases due to the necessity of regenerating candidate sequences dynamically; thus, optimizing this component represents a rich area for further development.

This paper serves as a critical reminder of the enduring relevance of classical methods within the ever-evolving landscape of deep learning, providing a comprehensive examination that may inspire future work in sequence prediction optimization in artificial intelligence.

PDF Markdown

Related Papers

Attention Is All You Need (2017)
Unsupervised Pretraining for Sequence to Sequence Learning (2016)
The Evolved Transformer (2019)
Joint Source-Target Self Attention with Locality Constraints (2019)
Weighted Transformer Network for Machine Translation (2017)