Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Attention Is All You Need (1706.03762v7)

Published 12 Jun 2017 in cs.CL and cs.LG
Attention Is All You Need

Abstract: The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

The paper "Attention Is All You Need" (Vaswani et al., 2017 ) introduced the Transformer architecture, a sequence transduction model entirely based on attention mechanisms, eschewing recurrence and convolution. This approach aimed to improve parallelization and reduce training time while achieving state-of-the-art results in tasks like machine translation.

Architecture Overview

The Transformer follows an encoder-decoder structure, common in sequence-to-sequence tasks. Both the encoder and decoder are composed of a stack of identical layers.

  • Encoder: The encoder stack consists of N=6N=6 identical layers. Each layer has two primary sub-layers: a multi-head self-attention mechanism and a position-wise fully connected feed-forward network. Residual connections are employed around each of the two sub-layers, followed by layer normalization. The output of each sub-layer is LayerNorm(x+Sublayer(x))\text{LayerNorm}(x + \text{Sublayer}(x)), where Sublayer(x)\text{Sublayer}(x) is the function implemented by the sub-layer itself. All sub-layers in the model, as well as the embedding layers, produce outputs of dimension dmodel=512d_{\text{model}} = 512.
  • Decoder: The decoder stack also consists of N=6N=6 identical layers. In addition to the two sub-layers found in the encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack. Similar to the encoder, residual connections and layer normalization are applied around each sub-layer. The self-attention sub-layer in the decoder stack is modified to prevent positions from attending to subsequent positions, ensuring the auto-regressive property. This is achieved by masking out (setting to -\infty) all values in the input of the softmax corresponding to illegal connections.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
+---------------------+      +---------------------+
|      Output         |      |     Inputs          |
|    Probabilities    |      |                     |
+----------+----------+      +----------+----------+
           ^                         |
           |                         v
+----------+----------+      +----------+----------+
|   Linear + Softmax  |      |  Output Embedding   |
+----------+----------+      +----------+----------+
           ^                         |
           |                         v
+----------+----------+      +----------+----------+
|    Decoder Stack    |      | Input Embedding + PE|
|       (N layers)    |      +----------+----------+
| - Masked Multi-Head |                 ^
|   Self-Attention    |                 |
| - Multi-Head Attn   |                 |
|   (Encoder Output)  |                 |
| - Feed Forward      |                 |
| - Add & Norm        |                 |
+----------+----------+                 |
           ^----------------------------+
           |
+----------+----------+
|    Encoder Stack    |
|       (N layers)    |
| - Multi-Head        |
|   Self-Attention    |
| - Feed Forward      |
| - Add & Norm        |
+----------+----------+
           ^
           |
+----------+----------+
| Input Embedding + PE|
+----------+----------+
           ^
           |
+----------+----------+
|      Inputs         |
+---------------------+

                Figure 1: Transformer Model Architecture

Attention Mechanisms

The core of the Transformer lies in its attention mechanisms, specifically Scaled Dot-Product Attention and Multi-Head Attention.

Scaled Dot-Product Attention

The input consists of queries QQ and keys KK of dimension dkd_k, and values VV of dimension dvd_v. The attention output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.

The formula is:

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

The dot products QKTQK^T compute the compatibility. Scaling by 1dk\frac{1}{\sqrt{d_k}} prevents the dot products from growing too large in magnitude, which could push the softmax function into regions with extremely small gradients.

Multi-Head Attention

Instead of performing a single attention function with dmodeld_{\text{model}}-dimensional keys, values, and queries, the authors found it beneficial to linearly project the queries, keys, and values hh times with different, learned linear projections to dkd_k, dkd_k, and dvd_v dimensions, respectively. Attention is then performed in parallel on each of these projected versions. The outputs of the hh heads are concatenated and once again projected, resulting in the final values.

MultiHead(Q,K,V)=Concat(head1,,headh)WO\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h)W^O

where headi=Attention(QWiQ,KWiK,VWiV)\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)

The projection matrices are WiQRdmodel×dkW_i^Q \in \mathbb{R}^{d_{\text{model}} \times d_k}, WiKRdmodel×dkW_i^K \in \mathbb{R}^{d_{\text{model}} \times d_k}, WiVRdmodel×dvW_i^V \in \mathbb{R}^{d_{\text{model}} \times d_v}, and $W^O \in \mathbb{R}^{hd_v \times d_{\text{model}}$. In the paper's implementation, h=8h=8 heads are used. For each head, dk=dv=dmodel/h=512/8=64d_k = d_v = d_{\text{model}}/h = 512/8 = 64. The total computational cost is similar to that of single-head attention with full dimensionality.

Applications of Attention in the Transformer

Multi-Head Attention is used in three distinct ways:

  1. Encoder Self-Attention: In the encoder layers, Q,K,VQ, K, V all come from the output of the previous encoder layer. Each position in the encoder can attend to all positions in the previous layer of the encoder.
  2. Decoder Self-Attention: In the decoder layers, Q,K,VQ, K, V all come from the output of the previous decoder layer. However, self-attention is restricted (masked) so that each position can only attend to preceding positions (including itself). This maintains the auto-regressive property.
  3. Encoder-Decoder Attention: In the third sub-layer of the decoder, QQ comes from the previous decoder layer, while KK and VV come from the output of the encoder stack. This allows every position in the decoder to attend over all positions in the input sequence.

Position-wise Feed-Forward Networks

In addition to attention sub-layers, each layer in the encoder and decoder contains a fully connected feed-forward network (FFN), applied position-wise and identically. This consists of two linear transformations with a ReLU activation in between:

FFN(x)=max(0,xW1+b1)W2+b2\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2

The dimensionality of input and output is dmodel=512d_{\text{model}} = 512, and the inner-layer has dimensionality dff=2048d_{ff} = 2048. While the linear transformations are the same across different positions, they use different parameters from layer to layer.

Positional Encoding

Since the model contains no recurrence or convolution, positional information is injected using positional encodings added to the input embeddings at the bottoms of the encoder and decoder stacks. The positional encodings have the same dimension dmodeld_{\text{model}} as the embeddings, allowing them to be summed. The paper uses sine and cosine functions of different frequencies:

PE(pos,2i)=sin(pos/100002i/dmodel)PE_{(\text{pos}, 2i)} = \sin(\text{pos} / 10000^{2i/d_{\text{model}}})

PE(pos,2i+1)=cos(pos/100002i/dmodel)PE_{(\text{pos}, 2i+1)} = \cos(\text{pos} / 10000^{2i/d_{\text{model}}})

where pos\text{pos} is the position and ii is the dimension. This allows the model to easily learn to attend by relative positions, since for any fixed offset kk, PEpos+kPE_{\text{pos}+k} can be represented as a linear function of PEposPE_{\text{pos}}.

Training Methodology

The model was trained on the WMT 2014 English-German dataset (~4.5 million sentence pairs) and the larger WMT 2014 English-French dataset (~36 million sentence pairs).

  • Optimizer: Adam optimizer was used with β1=0.9\beta_1 = 0.9, β2=0.98\beta_2 = 0.98, and ϵ=109\epsilon = 10^{-9}.
  • Learning Rate: The learning rate was varied according to the formula:

    lrate=dmodel0.5min(step_num0.5,step_numwarmup_steps1.5)lrate = d_{\text{model}}^{-0.5} \cdot \min(\text{step\_num}^{-0.5}, \text{step\_num} \cdot \text{warmup\_steps}^{-1.5})

    with warmup_steps=4000warmup\_steps = 4000. This increases the learning rate linearly for the first warmup_steps training steps, and then decreases it proportionally to the inverse square root of the step number.

  • Regularization: Two regularization techniques were employed:
    • Residual Dropout: Dropout (Pdrop=0.1P_{drop}=0.1) was applied to the output of each sub-layer, before it was added to the sub-layer input and normalized. Dropout was also applied to the sums of the embeddings and positional encodings in both the encoder and decoder stacks.
    • Label Smoothing: Label smoothing with ϵls=0.1\epsilon_{ls} = 0.1 was used during training. This hurts perplexity but improves accuracy and BLEU score.
  • Hardware and Schedule: Training was performed on 8 NVIDIA P100 GPUs. The base model took 100,000 steps (12 hours). The larger model trained for 300,000 steps (3.5 days).

Experimental Results

The Transformer demonstrated significant improvements in both translation quality and training efficiency compared to previous recurrent and convolutional models.

  • Machine Translation (WMT 2014):
    • English-to-German: Achieved a BLEU score of 28.4, outperforming the best previously reported models (including ensembles) by over 2 BLEU.
    • English-to-French: Established a new single-model state-of-the-art BLEU score of 41.8.
  • Training Cost: The models trained significantly faster than alternatives. The base model trained on 8 P100 GPUs achieved the state-of-the-art BLEU scores in just 3.5 days, a fraction of the training cost reported for competitive models at the time (e.g., Google's GNMT). The paper reported total training FLOPs comparing Transformer (base) at 3.3×10183.3 \times 10^{18} vs SliceNet at 1.0×10201.0 \times 10^{20}.
  • Parallelization: The reliance on attention mechanisms rather than recurrence allowed for significantly more parallelization during training, as computations within a layer could largely be performed simultaneously across sequence positions.
  • Generalization (English Constituency Parsing): The Transformer was also successfully applied to English constituency parsing using the WSJ dataset, achieving competitive results (F1 score of 91.3 with limited data, 92.7 with semi-supervised learning) demonstrating its applicability beyond machine translation.

The primary claimed advantages were superior translation quality, significantly enhanced parallelizability leading to reduced training times, and strong generalization capabilities. The self-attention mechanism allows modeling of long-range dependencies more directly than RNNs, while avoiding the sequential computation bottleneck.

Conclusion

The "Attention Is All You Need" paper introduced the Transformer, an architecture that fundamentally shifted the paradigm in sequence modeling away from recurrent networks towards pure attention mechanisms. Its design facilitated parallel computation, drastically reducing training times while simultaneously achieving superior performance on benchmark tasks like machine translation. The core components – multi-head self-attention, positional encodings, and position-wise feed-forward networks combined with residual connections and layer normalization – became foundational elements for subsequent LLMs and other sequence processing tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  2. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473, 2014.
  3. Massive exploration of neural machine translation architectures. CoRR, abs/1703.03906, 2017.
  4. Long short-term memory-networks for machine reading. arXiv preprint arXiv:1601.06733, 2016.
  5. Learning phrase representations using rnn encoder-decoder for statistical machine translation. CoRR, abs/1406.1078, 2014.
  6. Francois Chollet. Xception: Deep learning with depthwise separable convolutions. arXiv preprint arXiv:1610.02357, 2016.
  7. Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR, abs/1412.3555, 2014.
  8. Recurrent neural network grammars. In Proc. of NAACL, 2016.
  9. Convolutional sequence to sequence learning. arXiv preprint arXiv:1705.03122v2, 2017.
  10. Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.
  11. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
  12. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies, 2001.
  13. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  14. Self-training PCFG grammars with latent annotations across languages. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 832–841. ACL, August 2009.
  15. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410, 2016.
  16. Can active memory replace attention? In Advances in Neural Information Processing Systems, (NIPS), 2016.
  17. Neural GPUs learn algorithms. In International Conference on Learning Representations (ICLR), 2016.
  18. Neural machine translation in linear time. arXiv preprint arXiv:1610.10099v2, 2017.
  19. Structured attention networks. In International Conference on Learning Representations, 2017.
  20. Adam: A method for stochastic optimization. In ICLR, 2015.
  21. Factorization tricks for LSTM networks. arXiv preprint arXiv:1703.10722, 2017.
  22. A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130, 2017.
  23. Multi-task sequence to sequence learning. arXiv preprint arXiv:1511.06114, 2015.
  24. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025, 2015.
  25. Building a large annotated corpus of english: The penn treebank. Computational linguistics, 19(2):313–330, 1993.
  26. Effective self-training for parsing. In Proceedings of the Human Language Technology Conference of the NAACL, Main Conference, pages 152–159. ACL, June 2006.
  27. A decomposable attention model. In Empirical Methods in Natural Language Processing, 2016.
  28. A deep reinforced model for abstractive summarization. arXiv preprint arXiv:1705.04304, 2017.
  29. Learning accurate, compact, and interpretable tree annotation. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, pages 433–440. ACL, July 2006.
  30. Using the output embedding to improve language models. arXiv preprint arXiv:1608.05859, 2016.
  31. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909, 2015.
  32. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.
  33. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958, 2014.
  34. End-to-end memory networks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 2440–2448. Curran Associates, Inc., 2015.
  35. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, pages 3104–3112, 2014.
  36. Rethinking the inception architecture for computer vision. CoRR, abs/1512.00567, 2015.
  37. Grammar as a foreign language. In Advances in Neural Information Processing Systems, 2015.
  38. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
  39. Deep recurrent models with fast-forward connections for neural machine translation. CoRR, abs/1606.04199, 2016.
  40. Fast and accurate shift-reduce constituent parsing. In Proceedings of the 51st Annual Meeting of the ACL (Volume 1: Long Papers), pages 434–443. ACL, August 2013.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Ashish Vaswani (23 papers)
  2. Noam Shazeer (37 papers)
  3. Niki Parmar (17 papers)
  4. Jakob Uszkoreit (23 papers)
  5. Llion Jones (16 papers)
  6. Aidan N. Gomez (16 papers)
  7. Lukasz Kaiser (40 papers)
  8. Illia Polosukhin (7 papers)
Citations (113,543)
Youtube Logo Streamline Icon: https://streamlinehq.com

HackerNews