Emergent Mind

Attention Is All You Need

Published Jun 12, 2017 in cs.CL and cs.LG


The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.


  • The paper introduces the Transformer, a sequence transduction model that relies solely on attention mechanisms without convolutions or recurrences.

  • The Transformer's core is a self-attention mechanism that allows processing of sequences in parallel, enhancing training speed and efficiency.

  • Innovations include Scaled Dot-Product Attention, Multi-Head Attention, and Positional Encoding, which contribute to its superior performance.

  • Exceeds benchmarks on English-to-German and English-to-French translation, setting new records for both performance and training speed.

  • Versatile usage demonstrated by successful application to English constituency parsing, promising for future research in sequence-based tasks.

Introduction to the Transformer Model

The field of sequence transduction, or converting sequences of one form to another, has been dominated by models leveraging recurrent or convolutional neural networks, typically augmented with attention mechanisms. The paper discusses a novel architecture known as the Transformer, which eschews both convolutions and recurrences in favor of a fully attention-driven approach.

Core Architecture

The essence of the Transformer resides in its use of self-attention mechanisms that capture the global interdependencies between input and output without relying on time-step-based computations. This architecture comprises an encoder-decoder configuration, where the encoder processes the input sequence into a set of representations which the decoder then acts upon to generate an output sequence. Notably, each layer of both the encoder and decoder employs multi-head self-attention, allowing the system to focus on different parts of the sequence simultaneously, significantly enhancing parallelization and reducing training times.

Technical Innovations and Benefits

The Transformer introduces several key innovations including:

  • Scaled Dot-Product Attention, a variant of attention that scales the dot products by the inverse square root of dimensionality, ensuring stable gradient behavior.
  • Multi-Head Attention, allowing the model to jointly analyze information from different representational spaces.
  • Positional Encoding, injecting information about the position of each element in the sequence to make up for the lack of recurrence or convolution.

This architecture comfortably surpasses existing benchmarks on English-to-German and English-to-French translation tasks, in terms of both performance and speed. The Transformer establishes new state-of-the-art results on these tasks and demonstrates that it can be trained much faster than models based on either recurrent or convolutional layers. The implications of these findings suggest a substantial leap forward in the efficiency and effectiveness of sequence transduction models.

Further Applications and Contributions

Beyond its impressive performance on translation tasks, the Transformer was also successfully applied to English constituency parsing, showing its versatility. Despite its specificity to neither task during initial training, the model achieved high performance, rivaling state-of-the-art results in parsing.

In conclusion, the Transformer represents a paradigm shift in sequence modeling, establishing a new benchmark in efficiency and performance. The paper also opens up avenues for future research in applying attention-based models to a broader array of sequence-based tasks and exploring more focus on locality for improved handling of extensive inputs and outputs. The associated code and more technical details can be found on the repository linked at the end of the paper.

Get summaries of trending AI papers delivered straight to your inbox

Unsubscribe anytime.