Attention Is All You Need (1706.03762v7)

Published 12 Jun 2017 in cs.CL and cs.LG

Abstract: The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

PDF HTML Abstract

The paper "Attention Is All You Need" (Vaswani et al., 2017 ) introduced the Transformer architecture, a sequence transduction model entirely based on attention mechanisms, eschewing recurrence and convolution. This approach aimed to improve parallelization and reduce training time while achieving state-of-the-art results in tasks like machine translation.

Architecture Overview

The Transformer follows an encoder-decoder structure, common in sequence-to-sequence tasks. Both the encoder and decoder are composed of a stack of identical layers.

Encoder: The encoder stack consists of $N=6$ identical layers. Each layer has two primary sub-layers: a multi-head self-attention mechanism and a position-wise fully connected feed-forward network. Residual connections are employed around each of the two sub-layers, followed by layer normalization. The output of each sub-layer is $\text{LayerNorm}(x + \text{Sublayer}(x))$ , where $\text{Sublayer}(x)$ is the function implemented by the sub-layer itself. All sub-layers in the model, as well as the embedding layers, produce outputs of dimension $d_{\text{model}} = 512$ .
Decoder: The decoder stack also consists of $N=6$ identical layers. In addition to the two sub-layers found in the encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack. Similar to the encoder, residual connections and layer normalization are applied around each sub-layer. The self-attention sub-layer in the decoder stack is modified to prevent positions from attending to subsequent positions, ensuring the auto-regressive property. This is achieved by masking out (setting to $-\infty$ ) all values in the input of the softmax corresponding to illegal connections.

+---------------------+      +---------------------+
|      Output         |      |     Inputs          |
|    Probabilities    |      |                     |
+----------+----------+      +----------+----------+
           ^                         |
           |                         v
+----------+----------+      +----------+----------+
|   Linear + Softmax  |      |  Output Embedding   |
+----------+----------+      +----------+----------+
           ^                         |
           |                         v
+----------+----------+      +----------+----------+
|    Decoder Stack    |      | Input Embedding + PE|
|       (N layers)    |      +----------+----------+
| - Masked Multi-Head |                 ^
|   Self-Attention    |                 |
| - Multi-Head Attn   |                 |
|   (Encoder Output)  |                 |
| - Feed Forward      |                 |
| - Add & Norm        |                 |
+----------+----------+                 |
           ^----------------------------+
           |
+----------+----------+
|    Encoder Stack    |
|       (N layers)    |
| - Multi-Head        |
|   Self-Attention    |
| - Feed Forward      |
| - Add & Norm        |
+----------+----------+
           ^
           |
+----------+----------+
| Input Embedding + PE|
+----------+----------+
           ^
           |
+----------+----------+
|      Inputs         |
+---------------------+

                Figure 1: Transformer Model Architecture

Attention Mechanisms

The core of the Transformer lies in its attention mechanisms, specifically Scaled Dot-Product Attention and Multi-Head Attention.

Scaled Dot-Product Attention

The input consists of queries $Q$ and keys $K$ of dimension $d_k$ , and values $V$ of dimension $d_v$ . The attention output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.

The formula is:

$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$

The dot products $QK^T$ compute the compatibility. Scaling by $\frac{1}{\sqrt{d_k}}$ prevents the dot products from growing too large in magnitude, which could push the softmax function into regions with extremely small gradients.

Multi-Head Attention

Instead of performing a single attention function with $d_{\text{model}}$ -dimensional keys, values, and queries, the authors found it beneficial to linearly project the queries, keys, and values $h$ times with different, learned linear projections to $d_k$ , $d_k$ , and $d_v$ dimensions, respectively. Attention is then performed in parallel on each of these projected versions. The outputs of the $h$ heads are concatenated and once again projected, resulting in the final values.

$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h)W^O$

where $\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$

The projection matrices are $W_i^Q \in \mathbb{R}^{d_{\text{model}} \times d_k}$ , $W_i^K \in \mathbb{R}^{d_{\text{model}} \times d_k}$ , $W_i^V \in \mathbb{R}^{d_{\text{model}} \times d_v}$ , and $W^O \in \mathbb{R}^{hd_v \times d_{\text{model}}$. In the paper's implementation, $h=8$ heads are used. For each head, $d_k = d_v = d_{\text{model}}/h = 512/8 = 64$ . The total computational cost is similar to that of single-head attention with full dimensionality.

Applications of Attention in the Transformer

Multi-Head Attention is used in three distinct ways:

Encoder Self-Attention: In the encoder layers, $Q, K, V$ all come from the output of the previous encoder layer. Each position in the encoder can attend to all positions in the previous layer of the encoder.
Decoder Self-Attention: In the decoder layers, $Q, K, V$ all come from the output of the previous decoder layer. However, self-attention is restricted (masked) so that each position can only attend to preceding positions (including itself). This maintains the auto-regressive property.
Encoder-Decoder Attention: In the third sub-layer of the decoder, $Q$ comes from the previous decoder layer, while $K$ and $V$ come from the output of the encoder stack. This allows every position in the decoder to attend over all positions in the input sequence.

Position-wise Feed-Forward Networks

In addition to attention sub-layers, each layer in the encoder and decoder contains a fully connected feed-forward network (FFN), applied position-wise and identically. This consists of two linear transformations with a ReLU activation in between:

$\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2$

The dimensionality of input and output is $d_{\text{model}} = 512$ , and the inner-layer has dimensionality $d_{ff} = 2048$ . While the linear transformations are the same across different positions, they use different parameters from layer to layer.

Positional Encoding

Since the model contains no recurrence or convolution, positional information is injected using positional encodings added to the input embeddings at the bottoms of the encoder and decoder stacks. The positional encodings have the same dimension $d_{\text{model}}$ as the embeddings, allowing them to be summed. The paper uses sine and cosine functions of different frequencies:

$PE_{(\text{pos}, 2i)} = \sin(\text{pos} / 10000^{2i/d_{\text{model}}})$

$PE_{(\text{pos}, 2i+1)} = \cos(\text{pos} / 10000^{2i/d_{\text{model}}})$

where $\text{pos}$ is the position and $i$ is the dimension. This allows the model to easily learn to attend by relative positions, since for any fixed offset $k$ , $PE_{\text{pos}+k}$ can be represented as a linear function of $PE_{\text{pos}}$ .

Training Methodology

The model was trained on the WMT 2014 English-German dataset (~4.5 million sentence pairs) and the larger WMT 2014 English-French dataset (~36 million sentence pairs).

Optimizer: Adam optimizer was used with $\beta_1 = 0.9$ , $\beta_2 = 0.98$ , and $\epsilon = 10^{-9}$ .
Learning Rate: The learning rate was varied according to the formula:

$lrate = d_{\text{model}}^{-0.5} \cdot \min(\text{step\_num}^{-0.5}, \text{step\_num} \cdot \text{warmup\_steps}^{-1.5})$

with $warmup\_steps = 4000$ . This increases the learning rate linearly for the first warmup_steps training steps, and then decreases it proportionally to the inverse square root of the step number.
Regularization: Two regularization techniques were employed:
- Residual Dropout: Dropout ( $P_{drop}=0.1$ ) was applied to the output of each sub-layer, before it was added to the sub-layer input and normalized. Dropout was also applied to the sums of the embeddings and positional encodings in both the encoder and decoder stacks.
- Label Smoothing: Label smoothing with $\epsilon_{ls} = 0.1$ was used during training. This hurts perplexity but improves accuracy and BLEU score.
Hardware and Schedule: Training was performed on 8 NVIDIA P100 GPUs. The base model took 100,000 steps (12 hours). The larger model trained for 300,000 steps (3.5 days).

Experimental Results

The Transformer demonstrated significant improvements in both translation quality and training efficiency compared to previous recurrent and convolutional models.

Machine Translation (WMT 2014):
- English-to-German: Achieved a BLEU score of 28.4, outperforming the best previously reported models (including ensembles) by over 2 BLEU.
- English-to-French: Established a new single-model state-of-the-art BLEU score of 41.8.
Training Cost: The models trained significantly faster than alternatives. The base model trained on 8 P100 GPUs achieved the state-of-the-art BLEU scores in just 3.5 days, a fraction of the training cost reported for competitive models at the time (e.g., Google's GNMT). The paper reported total training FLOPs comparing Transformer (base) at $3.3 \times 10^{18}$ vs SliceNet at $1.0 \times 10^{20}$ .
Parallelization: The reliance on attention mechanisms rather than recurrence allowed for significantly more parallelization during training, as computations within a layer could largely be performed simultaneously across sequence positions.
Generalization (English Constituency Parsing): The Transformer was also successfully applied to English constituency parsing using the WSJ dataset, achieving competitive results (F1 score of 91.3 with limited data, 92.7 with semi-supervised learning) demonstrating its applicability beyond machine translation.

The primary claimed advantages were superior translation quality, significantly enhanced parallelizability leading to reduced training times, and strong generalization capabilities. The self-attention mechanism allows modeling of long-range dependencies more directly than RNNs, while avoiding the sequential computation bottleneck.

Conclusion

The "Attention Is All You Need" paper introduced the Transformer, an architecture that fundamentally shifted the paradigm in sequence modeling away from recurrent networks towards pure attention mechanisms. Its design facilitated parallel computation, drastically reducing training times while simultaneously achieving superior performance on benchmark tasks like machine translation. The core components – multi-head self-attention, positional encodings, and position-wise feed-forward networks combined with residual connections and layer normalization – became foundational elements for subsequent LLMs and other sequence processing tasks.