Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 154 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 38 tok/s Pro
GPT-5 High 36 tok/s Pro
GPT-4o 109 tok/s Pro
Kimi K2 206 tok/s Pro
GPT OSS 120B 434 tok/s Pro
Claude Sonnet 4.5 39 tok/s Pro
2000 character limit reached

Attention Is All You Need (1706.03762v7)

Published 12 Jun 2017 in cs.CL and cs.LG

Abstract: The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  2. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473, 2014.
  3. Massive exploration of neural machine translation architectures. CoRR, abs/1703.03906, 2017.
  4. Long short-term memory-networks for machine reading. arXiv preprint arXiv:1601.06733, 2016.
  5. Learning phrase representations using rnn encoder-decoder for statistical machine translation. CoRR, abs/1406.1078, 2014.
  6. Francois Chollet. Xception: Deep learning with depthwise separable convolutions. arXiv preprint arXiv:1610.02357, 2016.
  7. Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR, abs/1412.3555, 2014.
  8. Recurrent neural network grammars. In Proc. of NAACL, 2016.
  9. Convolutional sequence to sequence learning. arXiv preprint arXiv:1705.03122v2, 2017.
  10. Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.
  11. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
  12. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies, 2001.
  13. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  14. Self-training PCFG grammars with latent annotations across languages. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 832–841. ACL, August 2009.
  15. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410, 2016.
  16. Can active memory replace attention? In Advances in Neural Information Processing Systems, (NIPS), 2016.
  17. Neural GPUs learn algorithms. In International Conference on Learning Representations (ICLR), 2016.
  18. Neural machine translation in linear time. arXiv preprint arXiv:1610.10099v2, 2017.
  19. Structured attention networks. In International Conference on Learning Representations, 2017.
  20. Adam: A method for stochastic optimization. In ICLR, 2015.
  21. Factorization tricks for LSTM networks. arXiv preprint arXiv:1703.10722, 2017.
  22. A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130, 2017.
  23. Multi-task sequence to sequence learning. arXiv preprint arXiv:1511.06114, 2015.
  24. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025, 2015.
  25. Building a large annotated corpus of english: The penn treebank. Computational linguistics, 19(2):313–330, 1993.
  26. Effective self-training for parsing. In Proceedings of the Human Language Technology Conference of the NAACL, Main Conference, pages 152–159. ACL, June 2006.
  27. A decomposable attention model. In Empirical Methods in Natural Language Processing, 2016.
  28. A deep reinforced model for abstractive summarization. arXiv preprint arXiv:1705.04304, 2017.
  29. Learning accurate, compact, and interpretable tree annotation. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, pages 433–440. ACL, July 2006.
  30. Using the output embedding to improve language models. arXiv preprint arXiv:1608.05859, 2016.
  31. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909, 2015.
  32. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.
  33. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958, 2014.
  34. End-to-end memory networks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 2440–2448. Curran Associates, Inc., 2015.
  35. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, pages 3104–3112, 2014.
  36. Rethinking the inception architecture for computer vision. CoRR, abs/1512.00567, 2015.
  37. Grammar as a foreign language. In Advances in Neural Information Processing Systems, 2015.
  38. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
  39. Deep recurrent models with fast-forward connections for neural machine translation. CoRR, abs/1606.04199, 2016.
  40. Fast and accurate shift-reduce constituent parsing. In Proceedings of the 51st Annual Meeting of the ACL (Volume 1: Long Papers), pages 434–443. ACL, August 2013.
Citations (113,543)

Summary

  • The paper presents the Transformer architecture that replaces recurrence with self-attention, enabling improved parallelization and efficiency.
  • It introduces multi-head attention and positional encoding to capture long-range dependencies and diverse relationships in sequence transduction tasks.
  • Experimental results show state-of-the-art BLEU scores on WMT 2014 tasks, validating the model's effectiveness in machine translation and parsing.

The Transformer Network and the "Attention is All You Need" Paper

The paper "Attention Is All You Need" (1706.03762) introduces the Transformer, a novel neural network architecture that relies entirely on attention mechanisms for sequence transduction tasks, moving away from the dominant recurrent and convolutional approaches. This architectural shift enables greater parallelization and reduced training times, while achieving state-of-the-art results in machine translation.

Core Architectural Innovations

The Transformer architecture (Figure 1) abandons recurrence and convolutions in favor of self-attention mechanisms. It comprises an encoder and a decoder, both built from stacked layers. The encoder maps an input sequence to a sequence of continuous representations, and the decoder generates an output sequence one element at a time, auto-regressively.

(Figure 1)

Figure 1: The Transformer model architecture, showcasing encoder and decoder stacks.

Encoder and Decoder Stacks

Both the encoder and decoder consist of N=6N=6 identical layers. Each encoder layer includes a multi-head self-attention mechanism and a position-wise fully connected feed-forward network. Residual connections and layer normalization are applied around each sub-layer. The decoder mirrors this structure but includes an additional sub-layer that performs multi-head attention over the encoder's output. Masking is used in the decoder's self-attention sub-layer to prevent attending to subsequent positions, maintaining the auto-regressive property.

Scaled Dot-Product Attention

The core of the Transformer is the Scaled Dot-Product Attention mechanism (Figure 2), which computes the attention weights by scaling the dot products of queries and keys by dk\sqrt{d_k}, where dkd_k is the dimension of the keys. This scaling mitigates the problem of vanishing gradients that can occur with large values of dkd_k.

(Figure 2)

Figure 2: Illustration of Scaled Dot-Product Attention (left) and Multi-Head Attention (right).

Multi-Head Attention

Multi-Head Attention (Figure 2) extends the Scaled Dot-Product Attention by linearly projecting the queries, keys, and values hh times with different learned linear projections. This allows the model to attend to information from different representation subspaces, capturing more diverse dependencies. The outputs of the parallel attention heads are concatenated and projected to produce the final output. The paper uses h=8h=8 parallel attention layers.

Positional Encoding

To incorporate information about the order of tokens in the sequence, the Transformer employs positional encodings. Sine and cosine functions of different frequencies are added to the input embeddings. This allows the model to leverage relative positional information.

Advantages of Self-Attention

The paper argues that self-attention offers several advantages over recurrent and convolutional layers for sequence transduction:

  • Parallelization: Self-attention allows for more parallelization than recurrent layers, as it does not require sequential computation along the symbol positions.
  • Computational Complexity: Self-attention layers have lower computational complexity than recurrent layers when the sequence length nn is smaller than the representation dimensionality dd, a common scenario in machine translation.
  • Long-Range Dependencies: Self-attention reduces the path length between long-range dependencies in the network, making it easier to learn these dependencies.

Training Details

The Transformer models were trained on the WMT 2014 English-German and English-French datasets. Byte-pair encoding was used for encoding the sentences. The Adam optimizer was used with a learning rate schedule that increases linearly for the first warmup_stepswarmup\_steps and decreases proportionally to the inverse square root of the step number thereafter. Regularization techniques, including residual dropout and label smoothing, were employed to prevent overfitting.

Experimental Results

The Transformer achieved state-of-the-art results on the WMT 2014 English-to-German translation task, outperforming previous models by more than 2.0 BLEU. It also established a new single-model state-of-the-art BLEU score on the WMT 2014 English-to-French translation task. The models were trained in significantly less time than previous state-of-the-art models. The paper also demonstrates the Transformer's generalization ability by applying it to English constituency parsing, achieving competitive results.

Analysis of Attention Mechanisms

The paper provides visualizations of the attention distributions learned by the Transformer models (Figure 3, Figure 4, Figure 5). These visualizations reveal that different attention heads learn to perform different tasks, and many appear to capture syntactic and semantic relationships in the sentences. For example, some heads attend to long-distance dependencies, while others seem to be involved in anaphora resolution. Figure 3

Figure 3: An example of the attention mechanism following long-distance dependencies.

Figure 4

Figure 4

Figure 4: Two attention heads involved in anaphora resolution.

Figure 5

Figure 5

Figure 5: Examples of attention heads exhibiting behavior related to sentence structure.

Impact and Future Directions

The Transformer architecture has had a significant impact on the field of NLP, paving the way for models like BERT, GPT, and other LLMs. The paper identifies several promising directions for future research, including extending the Transformer to other tasks and modalities, investigating local attention mechanisms for handling large inputs and outputs, and reducing the sequentiality of generation.

Conclusion

The "Attention Is All You Need" paper introduced the Transformer, a novel and highly influential neural network architecture that relies entirely on attention mechanisms. The Transformer's ability to parallelize computation, its reduced computational complexity, and its effectiveness in capturing long-range dependencies have made it a cornerstone of modern NLP. The paper's findings have broad implications for sequence modeling and transduction tasks, and it has spurred a significant amount of research in attention-based models.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Explain it Like I'm 14

“Attention Is All You Need” — A Simple, Teen-Friendly Guide

What is this paper about?

This paper introduces a new way for computers to understand and generate language, called the Transformer. It’s a kind of AI model that uses “attention” to focus on the most important parts of a sentence. The big idea: you don’t need older, slower methods that read words one-by-one or slide over them. With attention alone, you can get better results, faster.

What questions were the researchers trying to answer?

Here are the main things they wanted to figure out:

  • Can a model that uses only attention (no older “step-by-step” or “sliding window” parts) work well for language tasks like translation?
  • Is this attention-only model faster to train and easier to run in parallel on modern computers?
  • Does it beat the best existing translation systems?
  • Can it work well on other language tasks too, not just translation?

How does their approach work? (In everyday language)

Think of reading a sentence like this: “The cat, which was very fluffy, jumped over the fence.” To understand “cat,” your brain might pay extra attention to “fluffy” and “jumped” because they tell you something important. That “paying attention” is the key idea.

  • Attention: It’s like a spotlight that shines on the most relevant words. Each word looks at other words and decides which ones matter most to understand the meaning.
  • Self-attention: Each word in a sentence looks at every other word in the same sentence. It’s like everyone in a group discussing who influences whom, all at once.
  • Multi-head attention: Imagine using several highlighters at the same time, each looking for a different kind of pattern (grammar, long-distance links, names, etc.). Then you combine what they found.
  • Encoder and decoder: The model has two main parts:
    • Encoder = a smart reader that turns the input sentence (like English) into a rich internal meaning.
    • Decoder = a smart writer that uses that meaning to produce the output sentence (like German), one word at a time, while still using attention to look back at the input.
  • Positional encoding: Since the Transformer doesn’t read words strictly in order, it needs to know where each word is in the sentence. It adds a kind of “position tag” (based on waves like sine and cosine) so it still understands word order.

Why is this better than older methods?

  • Older “recurrent” models read one word after another (like a person reading out loud). That makes them slow and hard to run in parallel.
  • The Transformer looks at all words at once, so it can be much faster on modern hardware and handle long-distance relationships between words more easily.

What did they find, and why does it matter?

The researchers tested the Transformer on big translation tasks and got state-of-the-art results:

  • English → German: Score of 28.4 BLEU (a standard translation quality score). This beat the best previous systems, even groups (ensembles) of models.
  • English → French: Score of 41.8 BLEU, also the best for a single model at the time.

And it trained fast:

  • The “base” Transformer reached strong results after about 12 hours on 8 GPUs.
  • The larger “big” model trained for about 3.5 days on 8 GPUs and set new records.

It also worked well on another task (English constituency parsing), showing the idea is flexible, not just for translation.

Why this is important:

  • Better quality translations with less training time means cheaper, faster AI development.
  • The approach is simpler and easier to scale because it runs well in parallel.
  • It handles long sentences and long-range word relationships more naturally.

What’s the bigger impact?

The Transformer opened the door to many of today’s most powerful language systems. Its attention-based design has become the foundation for models used in translation, summarization, chatbots, coding assistants, and more. Because it’s fast and scalable, it allows researchers and companies to build larger, smarter models that can understand and generate human language with impressive accuracy.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Glossary

  • Adam optimizer: An adaptive gradient-based optimization algorithm that maintains per-parameter estimates of first and second moments to adjust learning rates. "We used the Adam optimizer~\citep{kingma2014adam} with β1=0.9\beta_1=0.9, β2=0.98\beta_2=0.98 and ϵ=109\epsilon=10^{-9}."
  • Additive attention: An attention mechanism that computes query–key compatibility via a small feed-forward network rather than pure dot products. "Additive attention computes the compatibility function using a feed-forward network with a single hidden layer."
  • Auto-regressive: A property of sequence models that generate each output token conditioned on previously generated tokens. "At each step the model is auto-regressive \citep{graves2013generating}, consuming the previously generated symbols as additional input when generating the next."
  • Beam search: A heuristic decoding algorithm that tracks the top-k partial hypotheses to approximate the best output sequence. "We used beam search with a beam size of $4$ and length penalty α=0.6\alpha=0.6 \citep{wu2016google}."
  • BerkeleyParser: A statistical constituency parser based on latent-variable PCFGs commonly used as a baseline in parsing tasks. "In contrast to RNN sequence-to-sequence models \citep{KVparse15}, the Transformer outperforms the BerkeleyParser \cite{petrov-EtAl:2006:ACL} even when training only on the WSJ training set of 40K sentences."
  • BLEU: A machine translation evaluation metric measuring n-gram overlap between system outputs and references. "Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU."
  • Byte-pair encoding: A subword tokenization method that iteratively merges frequent symbol pairs to form a compact vocabulary. "Sentences were encoded using byte-pair encoding \citep{DBLP:journals/corr/BritzGLL17}, which has a shared source-target vocabulary of about 37000 tokens."
  • ByteNet: A convolutional sequence model for neural machine translation with logarithmic dependency paths via dilated convolutions. "The goal of reducing sequential computation also forms the foundation of the Extended Neural GPU \citep{extendedngpu}, ByteNet \citep{NalBytenet2017} and ConvS2S \citep{JonasFaceNet2017}, all of which use convolutional neural networks as basic building block, computing hidden representations in parallel for all input and output positions."
  • Convolutional neural networks: Neural architectures that process data via convolution operations, enabling parallel computation over sequence positions. "all of which use convolutional neural networks as basic building block, computing hidden representations in parallel for all input and output positions."
  • Constituency parsing: The task of producing phrase-structure trees that capture the hierarchical organization of sentences. "To evaluate if the Transformer can generalize to other tasks we performed experiments on English constituency parsing."
  • ConvS2S: A convolutional sequence-to-sequence model that replaces recurrent layers with convolutions. "The goal of reducing sequential computation also forms the foundation of the Extended Neural GPU \citep{extendedngpu}, ByteNet \citep{NalBytenet2017} and ConvS2S \citep{JonasFaceNet2017}, all of which use convolutional neural networks as basic building block, computing hidden representations in parallel for all input and output positions."
  • Decoder: The component of an encoder–decoder model that generates the output sequence token by token. "Given z\mathbf{z}, the decoder then generates an output sequence (y1,...,ym)(y_1,...,y_m) of symbols one element at a time."
  • Dilated convolutions: Convolution operations with spaced kernels that enlarge receptive fields without increasing parameter count. "or O(logk(n))O(log_k(n)) in the case of dilated convolutions \citep{NalBytenet2017}, increasing the length of the longest paths between any two positions in the network."
  • Dropout: A regularization technique that randomly zeroes activations during training to reduce overfitting. "We apply dropout \citep{srivastava2014dropout} to the output of each sub-layer, before it is added to the sub-layer input and normalized."
  • Encoder: The component that maps an input token sequence to a sequence of continuous vector representations. "Here, the encoder maps an input sequence of symbol representations (x1,...,xn)(x_1, ..., x_n) to a sequence of continuous representations z=(z1,...,zn)\mathbf{z} = (z_1, ..., z_n)."
  • Encoder-decoder attention: Cross-attention where decoder queries attend to encoder outputs to condition generation on the source. "In 'encoder-decoder attention' layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder."
  • End-to-end memory networks: Neural architectures that rely on attention over a memory component rather than sequence-aligned recurrence. "End-to-end memory networks are based on a recurrent attention mechanism instead of sequence-aligned recurrence and have been shown to perform well on simple-language question answering and language modeling tasks \citep{sukhbaatar2015}."
  • Ensemble: A combination of multiple trained models whose predictions are aggregated to improve accuracy. "Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU."
  • Feed-forward network: A fully connected network applied independently at each position to transform representations. "The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network."
  • FLOPs: Floating-point operations; a measure of computational training cost. "We estimate the number of floating point operations used to train a model by multiplying the training time, the number of GPUs used, and an estimate of the sustained single-precision floating-point capacity of each GPU"
  • Label smoothing: A regularization method that softens target distributions by distributing small probability mass to non-target classes. "During training, we employed label smoothing of value ϵls=0.1\epsilon_{ls}=0.1 \citep{DBLP:journals/corr/SzegedyVISW15}."
  • Layer normalization: A normalization technique applied across features in a layer to stabilize and accelerate training. "We employ a residual connection \citep{he2016deep} around each of the two sub-layers, followed by layer normalization \cite{layernorm2016}."
  • Length penalty: A term in sequence decoding that adjusts scores to favor or disfavor longer outputs. "We used beam search with a beam size of $4$ and length penalty α=0.6\alpha=0.6 \citep{wu2016google}."
  • Long short-term memory (LSTM): A gated recurrent architecture designed to learn long-range dependencies by mitigating vanishing gradients. "Recurrent neural networks, long short-term memory \citep{hochreiter1997} and gated recurrent \citep{gruEval14} neural networks in particular, have been firmly established as state of the art approaches in sequence modeling and transduction problems"
  • Mixture-of-Experts (MoE): A model that routes inputs to a subset of expert networks (often sparsely) to increase capacity efficiently. "MoE \citep{shazeer2017outrageously}"
  • Multi-head attention: Parallel attention mechanisms that project queries, keys, and values into multiple subspaces to capture diverse relations. "Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions."
  • Penn Treebank: A widely used annotated corpus of English text with syntactic parses. "We trained a 4-layer transformer with dmodel=1024d_{model} = 1024 on the Wall Street Journal (WSJ) portion of the Penn Treebank \citep{marcus1993building}, about 40K training sentences."
  • Perplexity: A measure of LLM uncertainty; lower values indicate better predictive performance. "This hurts perplexity, as the model learns to be more unsure, but improves accuracy and BLEU score."
  • Positional encodings: Deterministic or learned vectors added to token embeddings to inject sequence order information. "To this end, we add 'positional encodings' to the input embeddings at the bottoms of the encoder and decoder stacks."
  • Residual connection: A skip connection that adds the input of a sub-layer to its output to ease optimization of deep networks. "We employ a residual connection \citep{he2016deep} around each of the two sub-layers, followed by layer normalization"
  • Restricted self-attention: Self-attention constrained to a local window to reduce complexity for very long sequences. "self-attention could be restricted to considering only a neighborhood of size rr in the input sequence centered around the respective output position."
  • Scaled Dot-Product Attention: An attention mechanism that scales query–key dot products by 1/dk1/\sqrt{d_k} before softmax to stabilize gradients. "We call our particular attention 'Scaled Dot-Product Attention' (Figure~\ref{fig:multi-head-att})."
  • Self-attention: An attention mechanism that relates positions within the same sequence to compute context-aware representations. "Self-attention, sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence."
  • Separable convolutions: Factorized convolutions that reduce computational cost by separating spatial and channel-wise processing. "Separable convolutions \citep{xception2016}, however, decrease the complexity considerably, to O(knd+nd2)O(k \cdot n \cdot d + n \cdot d^2)."
  • Softmax: A normalization function that converts a vector of logits into a probability distribution. "apply a softmax function to obtain the weights on the values."
  • Tensor2Tensor: A software library and repository of deep learning models and datasets used in the paper’s experiments. "The code we used to train and evaluate our models is available at \url{https://github.com/tensorflow/tensor2tensor}."
  • Warmup steps: An initial training phase where the learning rate increases linearly before decaying, to stabilize optimization. "We used warmup_steps=4000warmup\_steps=4000."
  • Word-piece: A subword vocabulary construction method often used in neural machine translation. "split tokens into a 32000 word-piece vocabulary \citep{wu2016google}."
  • WMT: The Workshop on Machine Translation benchmark suite used to evaluate translation systems. "On the WMT 2014 English-to-German translation task, the big transformer model (Transformer (big) in Table~\ref{tab:wmt-results}) outperforms the best previously reported models (including ensembles) by more than $2.0$ BLEU"
  • WSJ: The Wall Street Journal section of the Penn Treebank commonly used for parsing benchmarks. "We trained a 4-layer transformer with dmodel=1024d_{model} = 1024 on the Wall Street Journal (WSJ) portion of the Penn Treebank \citep{marcus1993building}, about 40K training sentences."
Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 649 tweets and received 44461 likes.

Upgrade to Pro to view all of the tweets about this paper:

Youtube Logo Streamline Icon: https://streamlinehq.com

HackerNews