Sparse Sequence-to-Sequence Models (1905.05702v2)

Published 14 May 2019 in cs.CL and cs.LG

Abstract: Sequence-to-sequence models are a powerful workhorse of NLP. Most variants employ a softmax transformation in both their attention mechanism and output layer, leading to dense alignments and strictly positive output probabilities. This density is wasteful, making models less interpretable and assigning probability mass to many implausible outputs. In this paper, we propose sparse sequence-to-sequence models, rooted in a new family of $\alpha$-entmax transformations, which includes softmax and sparsemax as particular cases, and is sparse for any $\alpha > 1$. We provide fast algorithms to evaluate these transformations and their gradients, which scale well for large vocabulary sizes. Our models are able to produce sparse alignments and to assign nonzero probability to a short list of plausible outputs, sometimes rendering beam search exact. Experiments on morphological inflection and machine translation reveal consistent gains over dense models.

Authors (3)

Ben Peters (8 papers)
Vlad Niculae (39 papers)
André F. T. Martins (113 papers)

Citations (195)

View on Semantic Scholar

Summary

Sparse Sequence-to-Sequence Models: A Summary

The paper "Sparse Sequence-to-Sequence Models" by Ben Peters, Vlad Niculae, and André F. T. Martins introduces an advancement in the methodology of sequence-to-sequence (seq2seq) models by incorporating sparsity into both attention mechanisms and output predictions. This work bases itself on an innovative application of the $\alpha$ -entmax transformation, which includes softmax and sparsemax as specific instances, permitting sparsity in predictions when $\alpha > 1$ .

Background and Motivation

Seq2seq models are integral to a variety of NLP tasks such as machine translation, morphological inflection, and more. Traditional implementations of seq2seq models are dense by nature, utilizing the softmax function both in attention mechanisms and output layers. This leads to dense probability distributions where all vocabulary items possess a nonzero probability, even those that are not plausible. Such density reduces interpretability and allocates computational resources ineffectively across less likely outcomes.

Contribution and Methodology

This research introduces sparse seq2seq models through the use of $\alpha$ -entmax transformations which allow for sparse probability distributions. By adopting these transformations, the paper proposes an ability to produce more focused and interpretable attention alignments and output distributions that are non-zero only for a select subset of plausible sequence outputs. The $\alpha$ -entmax function is parameterized, interpolating between softmax (with $\alpha = 1$ ) and sparsemax (with $\alpha = 2$ ), providing flexibility to achieve different levels of sparsity.

Experimental Results and Findings

The authors have developed fast algorithms to compute $\alpha$ -entmax and its gradients, facilitating its use in large vocabulary settings. Experiments conducted on tasks such as morphological inflection and machine translation demonstrate that sparse seq2seq models show consistent performance improvements over dense models. Notably, the results indicate that by limiting the attention and output focus to fewer positions and outputs, models become more interpretable without sacrificing accuracy.

Significantly, in tasks with low ambiguity, such as morphological inflection, these models achieve exact decoding given that all plausible outputs can be enumerated without exhausting the beam search. For machine translation, sparsity aids in reducing the search space, thus improving efficiency.

Implications and Future Work

The implications of this work are multifaceted. Theoretical advancements include enhanced interpretability and potentially better alignment with human attention patterns in NLP tasks. Practically, the reduction in computational resource commitment to implausible predictions increases efficiency in model deployment.

For future research, exploring the integration of sparse attention in self-attending models—particularly in transformer architectures—seems promising. Additionally, extending such sparse methodologies to encompass broader areas in NLP and related domains poses an intriguing possibility for lowering resource costs while enhancing the quality of automated decision-making. The research herein lays a foundation for more focused, efficient, and interpretable seq2seq models. This work could catalyze further innovation across language understanding and generation tasks in artificial intelligence.

PDF Markdown

Related Papers

Predicting Attention Sparsity in Transformers (2021)
Adaptively Sparse Transformers (2019)
Speeding Up Entmax (2021)
Smoothing and Shrinking the Sparse Seq2Seq Search Space (2021)
Von Mises-Fisher Loss for Training Sequence to Sequence Models with Continuous Outputs (2018)

Find Related Papers