Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

One Wide Feedforward is All You Need (2309.01826v2)

Published 4 Sep 2023 in cs.CL and cs.AI

Abstract: The Transformer architecture has two main non-embedding components: Attention and the Feed Forward Network (FFN). Attention captures interdependencies between words regardless of their position, while the FFN non-linearly transforms each input token independently. In this work we explore the role of the FFN, and find that despite taking up a significant fraction of the model's parameters, it is highly redundant. Concretely, we are able to substantially reduce the number of parameters with only a modest drop in accuracy by removing the FFN on the decoder layers and sharing a single FFN across the encoder. Finally we scale this architecture back to its original size by increasing the hidden dimension of the shared FFN, achieving substantial gains in both accuracy and latency with respect to the original Transformer Big.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Telmo Pessoa Pires (5 papers)
  2. António V. Lopes (4 papers)
  3. Yannick Assogba (7 papers)
  4. Hendra Setiawan (10 papers)
Citations (7)

Summary

We haven't generated a summary for this paper yet.