Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 48 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 26 tok/s Pro
GPT-5 High 19 tok/s Pro
GPT-4o 107 tok/s Pro
Kimi K2 205 tok/s Pro
GPT OSS 120B 473 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

Attention Is All You Need (1706.03762v7)

Published 12 Jun 2017 in cs.CL and cs.LG

Abstract: The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  2. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473, 2014.
  3. Massive exploration of neural machine translation architectures. CoRR, abs/1703.03906, 2017.
  4. Long short-term memory-networks for machine reading. arXiv preprint arXiv:1601.06733, 2016.
  5. Learning phrase representations using rnn encoder-decoder for statistical machine translation. CoRR, abs/1406.1078, 2014.
  6. Francois Chollet. Xception: Deep learning with depthwise separable convolutions. arXiv preprint arXiv:1610.02357, 2016.
  7. Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR, abs/1412.3555, 2014.
  8. Recurrent neural network grammars. In Proc. of NAACL, 2016.
  9. Convolutional sequence to sequence learning. arXiv preprint arXiv:1705.03122v2, 2017.
  10. Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.
  11. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
  12. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies, 2001.
  13. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  14. Self-training PCFG grammars with latent annotations across languages. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 832–841. ACL, August 2009.
  15. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410, 2016.
  16. Can active memory replace attention? In Advances in Neural Information Processing Systems, (NIPS), 2016.
  17. Neural GPUs learn algorithms. In International Conference on Learning Representations (ICLR), 2016.
  18. Neural machine translation in linear time. arXiv preprint arXiv:1610.10099v2, 2017.
  19. Structured attention networks. In International Conference on Learning Representations, 2017.
  20. Adam: A method for stochastic optimization. In ICLR, 2015.
  21. Factorization tricks for LSTM networks. arXiv preprint arXiv:1703.10722, 2017.
  22. A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130, 2017.
  23. Multi-task sequence to sequence learning. arXiv preprint arXiv:1511.06114, 2015.
  24. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025, 2015.
  25. Building a large annotated corpus of english: The penn treebank. Computational linguistics, 19(2):313–330, 1993.
  26. Effective self-training for parsing. In Proceedings of the Human Language Technology Conference of the NAACL, Main Conference, pages 152–159. ACL, June 2006.
  27. A decomposable attention model. In Empirical Methods in Natural Language Processing, 2016.
  28. A deep reinforced model for abstractive summarization. arXiv preprint arXiv:1705.04304, 2017.
  29. Learning accurate, compact, and interpretable tree annotation. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, pages 433–440. ACL, July 2006.
  30. Using the output embedding to improve language models. arXiv preprint arXiv:1608.05859, 2016.
  31. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909, 2015.
  32. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.
  33. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958, 2014.
  34. End-to-end memory networks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 2440–2448. Curran Associates, Inc., 2015.
  35. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, pages 3104–3112, 2014.
  36. Rethinking the inception architecture for computer vision. CoRR, abs/1512.00567, 2015.
  37. Grammar as a foreign language. In Advances in Neural Information Processing Systems, 2015.
  38. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
  39. Deep recurrent models with fast-forward connections for neural machine translation. CoRR, abs/1606.04199, 2016.
  40. Fast and accurate shift-reduce constituent parsing. In Proceedings of the 51st Annual Meeting of the ACL (Volume 1: Long Papers), pages 434–443. ACL, August 2013.
Citations (113,543)

Summary

  • The paper presents the Transformer architecture that replaces recurrence with self-attention, enabling improved parallelization and efficiency.
  • It introduces multi-head attention and positional encoding to capture long-range dependencies and diverse relationships in sequence transduction tasks.
  • Experimental results show state-of-the-art BLEU scores on WMT 2014 tasks, validating the model's effectiveness in machine translation and parsing.

The Transformer Network and the "Attention is All You Need" Paper

The paper "Attention Is All You Need" (1706.03762) introduces the Transformer, a novel neural network architecture that relies entirely on attention mechanisms for sequence transduction tasks, moving away from the dominant recurrent and convolutional approaches. This architectural shift enables greater parallelization and reduced training times, while achieving state-of-the-art results in machine translation.

Core Architectural Innovations

The Transformer architecture (Figure 1) abandons recurrence and convolutions in favor of self-attention mechanisms. It comprises an encoder and a decoder, both built from stacked layers. The encoder maps an input sequence to a sequence of continuous representations, and the decoder generates an output sequence one element at a time, auto-regressively.

(Figure 1)

Figure 1: The Transformer model architecture, showcasing encoder and decoder stacks.

Encoder and Decoder Stacks

Both the encoder and decoder consist of N=6N=6 identical layers. Each encoder layer includes a multi-head self-attention mechanism and a position-wise fully connected feed-forward network. Residual connections and layer normalization are applied around each sub-layer. The decoder mirrors this structure but includes an additional sub-layer that performs multi-head attention over the encoder's output. Masking is used in the decoder's self-attention sub-layer to prevent attending to subsequent positions, maintaining the auto-regressive property.

Scaled Dot-Product Attention

The core of the Transformer is the Scaled Dot-Product Attention mechanism (Figure 2), which computes the attention weights by scaling the dot products of queries and keys by dk\sqrt{d_k}, where dkd_k is the dimension of the keys. This scaling mitigates the problem of vanishing gradients that can occur with large values of dkd_k.

(Figure 2)

Figure 2: Illustration of Scaled Dot-Product Attention (left) and Multi-Head Attention (right).

Multi-Head Attention

Multi-Head Attention (Figure 2) extends the Scaled Dot-Product Attention by linearly projecting the queries, keys, and values hh times with different learned linear projections. This allows the model to attend to information from different representation subspaces, capturing more diverse dependencies. The outputs of the parallel attention heads are concatenated and projected to produce the final output. The paper uses h=8h=8 parallel attention layers.

Positional Encoding

To incorporate information about the order of tokens in the sequence, the Transformer employs positional encodings. Sine and cosine functions of different frequencies are added to the input embeddings. This allows the model to leverage relative positional information.

Advantages of Self-Attention

The paper argues that self-attention offers several advantages over recurrent and convolutional layers for sequence transduction:

  • Parallelization: Self-attention allows for more parallelization than recurrent layers, as it does not require sequential computation along the symbol positions.
  • Computational Complexity: Self-attention layers have lower computational complexity than recurrent layers when the sequence length nn is smaller than the representation dimensionality dd, a common scenario in machine translation.
  • Long-Range Dependencies: Self-attention reduces the path length between long-range dependencies in the network, making it easier to learn these dependencies.

Training Details

The Transformer models were trained on the WMT 2014 English-German and English-French datasets. Byte-pair encoding was used for encoding the sentences. The Adam optimizer was used with a learning rate schedule that increases linearly for the first warmup_stepswarmup\_steps and decreases proportionally to the inverse square root of the step number thereafter. Regularization techniques, including residual dropout and label smoothing, were employed to prevent overfitting.

Experimental Results

The Transformer achieved state-of-the-art results on the WMT 2014 English-to-German translation task, outperforming previous models by more than 2.0 BLEU. It also established a new single-model state-of-the-art BLEU score on the WMT 2014 English-to-French translation task. The models were trained in significantly less time than previous state-of-the-art models. The paper also demonstrates the Transformer's generalization ability by applying it to English constituency parsing, achieving competitive results.

Analysis of Attention Mechanisms

The paper provides visualizations of the attention distributions learned by the Transformer models (Figure 3, Figure 4, Figure 5). These visualizations reveal that different attention heads learn to perform different tasks, and many appear to capture syntactic and semantic relationships in the sentences. For example, some heads attend to long-distance dependencies, while others seem to be involved in anaphora resolution. Figure 3

Figure 3: An example of the attention mechanism following long-distance dependencies.

Figure 4

Figure 4

Figure 4: Two attention heads involved in anaphora resolution.

Figure 5

Figure 5

Figure 5: Examples of attention heads exhibiting behavior related to sentence structure.

Impact and Future Directions

The Transformer architecture has had a significant impact on the field of NLP, paving the way for models like BERT, GPT, and other LLMs. The paper identifies several promising directions for future research, including extending the Transformer to other tasks and modalities, investigating local attention mechanisms for handling large inputs and outputs, and reducing the sequentiality of generation.

Conclusion

The "Attention Is All You Need" paper introduced the Transformer, a novel and highly influential neural network architecture that relies entirely on attention mechanisms. The Transformer's ability to parallelize computation, its reduced computational complexity, and its effectiveness in capturing long-range dependencies have made it a cornerstone of modern NLP. The paper's findings have broad implications for sequence modeling and transduction tasks, and it has spurred a significant amount of research in attention-based models.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Youtube Logo Streamline Icon: https://streamlinehq.com

HackerNews