Adaptively Sparse Transformers (1909.00015v2)

Published 30 Aug 2019 in cs.CL and stat.ML

Abstract: Attention mechanisms have become ubiquitous in NLP. Recent architectures, notably the Transformer, learn powerful context-aware word representations through layered, multi-headed attention. The multiple heads learn diverse types of word relationships. However, with standard softmax attention, all attention heads are dense, assigning a non-zero weight to all context words. In this work, we introduce the adaptively sparse Transformer, wherein attention heads have flexible, context-dependent sparsity patterns. This sparsity is accomplished by replacing softmax with $\alpha$-entmax: a differentiable generalization of softmax that allows low-scoring words to receive precisely zero weight. Moreover, we derive a method to automatically learn the $\alpha$ parameter -- which controls the shape and sparsity of $\alpha$-entmax -- allowing attention heads to choose between focused or spread-out behavior. Our adaptively sparse Transformer improves interpretability and head diversity when compared to softmax Transformers on machine translation datasets. Findings of the quantitative and qualitative analysis of our approach include that heads in different layers learn different sparsity preferences and tend to be more diverse in their attention distributions than softmax Transformers. Furthermore, at no cost in accuracy, sparsity in attention heads helps to uncover different head specializations.

PDF Abstract

Adaptively Sparse Transformers

The paper "Adaptively Sparse Transformers" introduces a novel modification to the Transformer architecture, aimed at enhancing the sparsity of attention mechanisms. This work presents the adaptively sparse Transformer, which offers flexibility in attention head sparsity that is both context-dependent and learnable.

Introduction and Motivation

The Transformer model, prominent in NLP tasks, particularly for Neural Machine Translation (NMT), utilizes multi-head attention mechanisms to derive context-aware word representations. In conventional approaches, softmax is used to ensure each word in the context receives non-zero attention weight. However, this paper argues that such dense attention can obscure interpretability and limit the model's flexibility.

Methodology

The key innovation in this paper is the replacement of softmax with $\alpha$ -\entmaxtext{}, a differentiable generalization that permits sparse attention distributions by setting certain weights to zero. This is controlled by the $\alpha$ parameter, which the authors propose to learn automatically. This parameter allows the model to dynamically switch between dense and sparse attention, optimizing the sparsity pattern at each attention head based on context.

Numerical Results

The adaptively sparse Transformer was tested on several machine translation datasets, with results indicating that it matches or slightly surpasses the performance of the traditional Transformer. Notably, it achieves this without increasing the model's complexity, maintaining accuracy while adding sparsity. This encourages diverse specialization across attention heads.

Analysis and Implications

An in-depth analysis demonstrates that different heads adopt varying sparsity patterns, thereby improving head diversity. This diversity is quantitatively measured using Jensen-Shannon Divergence, showing greater disagreement among heads compared to the softmax baseline. Moreover, certain heads exhibit clear specializations, such as positional awareness or BPE-merging capabilities, which enhance interpretability.

Theoretical and Practical Implications

From a theoretical standpoint, introducing adaptively sparse attention has implications for the paper of attention mechanisms in neural networks, suggesting that sparsity can be beneficial and learned dynamically. Practically, these findings propose efficiency improvements in model execution, as fewer weights need computation, leading to potential improvements in speed without loss of accuracy.

Future Directions

The paper opens avenues for further exploration into static variations inspired by dynamic behaviors identified in this model, such as deterministic positional heads. Moreover, the methodology for adaptively learning sparsity parameters can be explored in other architectures beyond Transformers to assess its general utility in deep learning.

In summary, this paper contributes a valuable perspective on managing attention sparsity in Transformers, providing insights into both improving model interpretability and maintaining performance through adaptive approaches.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Gonçalo M. Correia (6 papers)
Vlad Niculae (39 papers)
André F. T. Martins (113 papers)

Citations (230)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - deep-spin/entmax: The entmax mapping and its loss, a family of sparse softmax alternatives. (440 stars)

Tweets

https://twitter.com/devgerred/status/1884705486202699950

YouTube

Show All Videos