Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Neural Turing Machines (1410.5401v2)

Published 20 Oct 2014 in cs.NE

Abstract: We extend the capabilities of neural networks by coupling them to external memory resources, which they can interact with by attentional processes. The combined system is analogous to a Turing Machine or Von Neumann architecture but is differentiable end-to-end, allowing it to be efficiently trained with gradient descent. Preliminary results demonstrate that Neural Turing Machines can infer simple algorithms such as copying, sorting, and associative recall from input and output examples.

Citations (2,260)

Summary

  • The paper introduces a novel neural network architecture that couples a controller with an external memory for algorithm execution.
  • It details an end-to-end differentiable design using content-based and location-based addressing for efficient read and write operations.
  • Experiments on copy, repeat copy, and associative recall tasks show NTMs outperform LSTMs in long-term memory and generalization.

The paper introduces the Neural Turing Machine (NTM), a novel neural network architecture augmented with external memory, drawing inspiration from Turing machines and cognitive models of working memory. The NTM architecture consists of a neural network controller coupled with an external memory bank, enabling it to learn and execute algorithms. The key innovation is the end-to-end differentiability of the entire system, allowing for efficient training via gradient descent.

The NTM architecture comprises two primary components: a neural network controller, which can be either a feedforward network or a recurrent network such as an LSTM (Long Short-Term Memory), and a memory matrix. The controller interacts with the external environment through input and output vectors and accesses the memory matrix using read and write heads. These heads emit parameters that define normalized weightings over memory locations, determining the extent to which each location is read from or written to.

The read operation involves computing a weighted sum of the memory rows, where the weights are determined by the read head's weighting vector wt\mathbf{w}_t. The read vector rt\mathbf{r}_t is calculated as:

rt=iwt(i)Mt(i)\mathbf{r}_t = \sum_i{w_t(i) \mathbf{M}_t(i)}

  • rt\mathbf{r}_t is the read vector at time tt
  • wt(i)w_t(i) is the weighting for memory location ii at time tt
  • Mt(i)\mathbf{M}_t(i) is the row vector at memory location ii at time tt

The write operation is decomposed into an erase followed by an add, inspired by the gating mechanisms in LSTMs. The erase operation modifies the memory content based on the write head's weighting wt\mathbf{w}_t and an erase vector et\mathbf{e}_t:

M~t(i)=Mt1(i)[1wt(i)et]\mathbf{\tilde{M}_t(i)} = \mathbf{M}_{t-1}(i) \left[\mathbf{1}-w_t(i) \mathbf{e}_t\right]

  • M~t(i)\mathbf{\tilde{M}_t(i)} is the intermediate memory vector at location ii after the erase operation at time tt
  • Mt1(i)\mathbf{M}_{t-1}(i) is the memory vector at location ii at time t1t-1
  • wt(i)w_t(i) is the weighting for memory location ii at time tt
  • et\mathbf{e}_t is the erase vector at time tt
  • 1\mathbf{1} is a vector of all ones

The add operation then adds a weighted version of the add vector at\mathbf{a}_t to the memory:

Mt(i)=M~t(i)+wt(i)at\mathbf{M}_t(i) = \mathbf{\tilde{M}_t(i) + w_t(i)\, \mathbf{a}_t}

  • Mt(i)\mathbf{M}_t(i) is the updated memory vector at location ii at time tt
  • M~t(i)\mathbf{\tilde{M}_t(i)} is the intermediate memory vector at location ii after the erase operation at time tt
  • wt(i)w_t(i) is the weighting for memory location ii at time tt
  • at\mathbf{a}_t is the add vector at time tt

The paper details two complementary addressing mechanisms: content-based addressing and location-based addressing. Content-based addressing focuses on memory locations based on the similarity between their content and a key vector kt\mathbf{k}_t emitted by the head. The similarity is quantified using a similarity measure K[,]K\big[\cdot, \cdot\big], such as cosine similarity:

K[u,v]=uvuvK \big[ \mathbf{u}, \mathbf{v} \big] = \frac{\mathbf{u} \cdot \mathbf{v}}{||\mathbf{u}|| \cdot ||\mathbf{v}||}

  • u\mathbf{u} and v\mathbf{v} are vectors for which similarity is being computed

The content-based weighting wtc(i)w^c_t(i) is then computed as:

wtc(i)=exp(βtK[kt,Mt(i)])jexp(βtK[kt,Mt(j)])w^c_t(i) = \frac{\exp\bigg(\beta_t K\big[\mathbf{k}_t, \mathbf{M}_t(i)\big]\bigg)}{\sum_j \exp\bigg(\beta_t K\big[\mathbf{k}_t, \mathbf{M}_t(j)\big]\bigg)}

  • wtc(i)w^c_t(i) is the content-based weighting for location ii at time tt
  • βt\beta_t is the key strength at time tt
  • K[,]K\big[\cdot, \cdot\big] is the similarity measure
  • kt\mathbf{k}_t is the key vector at time tt
  • Mt(i)\mathbf{M}_t(i) is the memory vector at location ii at time tt

Location-based addressing, on the other hand, facilitates iteration and random-access jumps across memory locations. It involves an interpolation gate gtg_t, a shift weighting st\mathbf{s}_t, and a sharpening parameter γt\gamma_t. The interpolation gate blends the content-based weighting with the previous weighting:

wtg=gtwtc+(1gt)wt1\mathbf{w}^g_t = g_t \mathbf{w}^c_{t} + (1-g_t) \mathbf{w}_{t-1}

  • wtg\mathbf{w}^g_t is the gated weighting at time tt
  • gtg_t is the interpolation gate at time tt
  • wtc\mathbf{w}^c_{t} is the content-based weighting at time tt
  • wt1\mathbf{w}_{t-1} is the weighting at the previous time step

The shift weighting st\mathbf{s}_t defines a distribution over allowed integer shifts, and a circular convolution is applied to rotate the gated weighting:

w~t(i)=j=0N1wtg(j)st(ij)\tilde{w}_t(i) = \sum_{j=0}^{N-1} w^g_t(j)\, s_t(i-j)

  • w~t(i)\tilde{w}_t(i) is the rotated weighting for location ii at time tt
  • wtg(j)w^g_t(j) is the gated weighting for location jj at time tt
  • st(ij)s_t(i-j) is the shift weighting for the shift from location jj to location ii at time tt

Finally, the sharpening parameter γt\gamma_t sharpens the weighting:

wt(i)=w~t(i)γtjw~t(j)γtw_t(i) = \frac{\tilde{w}_t(i)^{\gamma_t}}{\sum_j \tilde{w}_t(j)^{\gamma_t}}

  • wt(i)w_t(i) is the final weighting for location ii at time tt
  • w~t(i)\tilde{w}_t(i) is the rotated weighting for location ii at time tt
  • γt\gamma_t is the sharpening parameter at time tt

The paper presents several experiments to evaluate the NTM's ability to learn algorithms. These include:

  • Copy Task: The NTM learns to copy sequences of binary vectors, demonstrating its ability to store and recall information over long time periods. The NTM significantly outperforms LSTM in this task and generalizes well to longer sequences than seen during training.
  • Repeat Copy Task: The NTM learns to repeat a sequence of binary vectors a specified number of times, showcasing its ability to learn nested functions or "for loops". While both NTM and LSTM can solve the task, NTM generalizes better to longer sequences and a higher number of repetitions.
  • Associative Recall: The NTM learns to recall the next item in a sequence when queried with a previous item, demonstrating its capability for learning indirection and managing more complex data structures. NTM learns significantly faster than LSTM and generalizes much better to longer sequences.
  • Dynamic N-Grams: The NTM is tested on its ability to rapidly adapt to new predictive distributions by emulating a conventional N-Gram model. The NTM achieves a small but significant performance advantage over LSTM.
  • Priority Sort: The NTM learns to sort a sequence of binary vectors based on their priority ratings, demonstrating its ability to perform elementary algorithms. The NTM with both feedforward and LSTM controllers substantially outperforms LSTM on this task.

Across these experiments, the NTM consistently demonstrates superior performance compared to LSTM, particularly in tasks requiring long-term memory, algorithmic reasoning, and generalization. The paper also analyzes the internal operations of the NTM, revealing that it learns compact internal programs and utilizes its memory in a manner analogous to how a human programmer would approach similar tasks.

Youtube Logo Streamline Icon: https://streamlinehq.com