Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Gated recurrent neural networks discover attention (2309.01775v2)

Published 4 Sep 2023 in cs.LG and cs.NE

Abstract: Recent architectural developments have enabled recurrent neural networks (RNNs) to reach and even surpass the performance of Transformers on certain sequence modeling tasks. These modern RNNs feature a prominent design pattern: linear recurrent layers interconnected by feedforward paths with multiplicative gating. Here, we show how RNNs equipped with these two design elements can exactly implement (linear) self-attention, the main building block of Transformers. By reverse-engineering a set of trained RNNs, we find that gradient descent in practice discovers our construction. In particular, we examine RNNs trained to solve simple in-context learning tasks on which Transformers are known to excel and find that gradient descent instills in our RNNs the same attention-based in-context learning algorithm used by Transformers. Our findings highlight the importance of multiplicative interactions in neural networks and suggest that certain RNNs might be unexpectedly implementing attention under the hood.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Nicolas Zucchet (11 papers)
  2. Seijin Kobayashi (16 papers)
  3. Yassir Akram (7 papers)
  4. Johannes von Oswald (21 papers)
  5. Maxime Larcher (8 papers)
  6. Angelika Steger (33 papers)
  7. João Sacramento (27 papers)
Citations (7)