Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Frustratingly Short Attention Spans in Neural Language Modeling (1702.04521v1)

Published 15 Feb 2017 in cs.CL, cs.AI, cs.LG, and cs.NE

Abstract: Neural LLMs predict the next token using a latent representation of the immediate token history. Recently, various methods for augmenting neural LLMs with an attention mechanism over a differentiable memory have been proposed. For predicting the next token, these models query information from a memory of the recent history which can facilitate learning mid- and long-range dependencies. However, conventional attention mechanisms used in memory-augmented neural LLMs produce a single output vector per time step. This vector is used both for predicting the next token as well as for the key and value of a differentiable memory of a token history. In this paper, we propose a neural LLM with a key-value attention mechanism that outputs separate representations for the key and value of a differentiable memory, as well as for encoding the next-word distribution. This model outperforms existing memory-augmented neural LLMs on two corpora. Yet, we found that our method mainly utilizes a memory of the five most recent output representations. This led to the unexpected main finding that a much simpler model based only on the concatenation of recent output representations from previous time steps is on par with more sophisticated memory-augmented neural LLMs.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Michał Daniluk (7 papers)
  2. Tim Rocktäschel (86 papers)
  3. Johannes Welbl (20 papers)
  4. Sebastian Riedel (140 papers)
Citations (110)
X Twitter Logo Streamline Icon: https://streamlinehq.com