Neural Turing Machines (1410.5401v2)

Published 20 Oct 2014 in cs.NE

Abstract: We extend the capabilities of neural networks by coupling them to external memory resources, which they can interact with by attentional processes. The combined system is analogous to a Turing Machine or Von Neumann architecture but is differentiable end-to-end, allowing it to be efficiently trained with gradient descent. Preliminary results demonstrate that Neural Turing Machines can infer simple algorithms such as copying, sorting, and associative recall from input and output examples.

Citations (2,260)

View on Semantic Scholar

Summary

The paper introduces a novel neural network architecture that couples a controller with an external memory for algorithm execution.
It details an end-to-end differentiable design using content-based and location-based addressing for efficient read and write operations.
Experiments on copy, repeat copy, and associative recall tasks show NTMs outperform LSTMs in long-term memory and generalization.

The paper introduces the Neural Turing Machine (NTM), a novel neural network architecture augmented with external memory, drawing inspiration from Turing machines and cognitive models of working memory. The NTM architecture consists of a neural network controller coupled with an external memory bank, enabling it to learn and execute algorithms. The key innovation is the end-to-end differentiability of the entire system, allowing for efficient training via gradient descent.

The NTM architecture comprises two primary components: a neural network controller, which can be either a feedforward network or a recurrent network such as an LSTM (Long Short-Term Memory), and a memory matrix. The controller interacts with the external environment through input and output vectors and accesses the memory matrix using read and write heads. These heads emit parameters that define normalized weightings over memory locations, determining the extent to which each location is read from or written to.

The read operation involves computing a weighted sum of the memory rows, where the weights are determined by the read head's weighting vector $\mathbf{w}_t$ . The read vector $\mathbf{r}_t$ is calculated as:

$\mathbf{r}_t = \sum_i{w_t(i) \mathbf{M}_t(i)}$

$\mathbf{r}_t$ is the read vector at time $t$
$w_t(i)$ is the weighting for memory location $i$ at time $t$
$\mathbf{M}_t(i)$ is the row vector at memory location $i$ at time $t$

The write operation is decomposed into an erase followed by an add, inspired by the gating mechanisms in LSTMs. The erase operation modifies the memory content based on the write head's weighting $\mathbf{w}_t$ and an erase vector $\mathbf{e}_t$ :

$\mathbf{\tilde{M}_t(i)} = \mathbf{M}_{t-1}(i) \left[\mathbf{1}-w_t(i) \mathbf{e}_t\right]$

$\mathbf{\tilde{M}_t(i)}$ is the intermediate memory vector at location $i$ after the erase operation at time $t$
$\mathbf{M}_{t-1}(i)$ is the memory vector at location $i$ at time $t-1$
$w_t(i)$ is the weighting for memory location $i$ at time $t$
$\mathbf{e}_t$ is the erase vector at time $t$
$\mathbf{1}$ is a vector of all ones

The add operation then adds a weighted version of the add vector $\mathbf{a}_t$ to the memory:

$\mathbf{M}_t(i) = \mathbf{\tilde{M}_t(i) + w_t(i)\, \mathbf{a}_t}$

$\mathbf{M}_t(i)$ is the updated memory vector at location $i$ at time $t$
$\mathbf{\tilde{M}_t(i)}$ is the intermediate memory vector at location $i$ after the erase operation at time $t$
$w_t(i)$ is the weighting for memory location $i$ at time $t$
$\mathbf{a}_t$ is the add vector at time $t$

The paper details two complementary addressing mechanisms: content-based addressing and location-based addressing. Content-based addressing focuses on memory locations based on the similarity between their content and a key vector $\mathbf{k}_t$ emitted by the head. The similarity is quantified using a similarity measure $K\big[\cdot, \cdot\big]$ , such as cosine similarity:

$K \big[ \mathbf{u}, \mathbf{v} \big] = \frac{\mathbf{u} \cdot \mathbf{v}}{||\mathbf{u}|| \cdot ||\mathbf{v}||}$

$\mathbf{u}$ and $\mathbf{v}$ are vectors for which similarity is being computed

The content-based weighting $w^c_t(i)$ is then computed as:

$w^c_t(i) = \frac{\exp\bigg(\beta_t K\big[\mathbf{k}_t, \mathbf{M}_t(i)\big]\bigg)}{\sum_j \exp\bigg(\beta_t K\big[\mathbf{k}_t, \mathbf{M}_t(j)\big]\bigg)}$

$w^c_t(i)$ is the content-based weighting for location $i$ at time $t$
$\beta_t$ is the key strength at time $t$
$K\big[\cdot, \cdot\big]$ is the similarity measure
$\mathbf{k}_t$ is the key vector at time $t$
$\mathbf{M}_t(i)$ is the memory vector at location $i$ at time $t$

Location-based addressing, on the other hand, facilitates iteration and random-access jumps across memory locations. It involves an interpolation gate $g_t$ , a shift weighting $\mathbf{s}_t$ , and a sharpening parameter $\gamma_t$ . The interpolation gate blends the content-based weighting with the previous weighting:

$\mathbf{w}^g_t = g_t \mathbf{w}^c_{t} + (1-g_t) \mathbf{w}_{t-1}$

$\mathbf{w}^g_t$ is the gated weighting at time $t$
$g_t$ is the interpolation gate at time $t$
$\mathbf{w}^c_{t}$ is the content-based weighting at time $t$
$\mathbf{w}_{t-1}$ is the weighting at the previous time step

The shift weighting $\mathbf{s}_t$ defines a distribution over allowed integer shifts, and a circular convolution is applied to rotate the gated weighting:

$\tilde{w}_t(i) = \sum_{j=0}^{N-1} w^g_t(j)\, s_t(i-j)$

$\tilde{w}_t(i)$ is the rotated weighting for location $i$ at time $t$
$w^g_t(j)$ is the gated weighting for location $j$ at time $t$
$s_t(i-j)$ is the shift weighting for the shift from location $j$ to location $i$ at time $t$

Finally, the sharpening parameter $\gamma_t$ sharpens the weighting:

$w_t(i) = \frac{\tilde{w}_t(i)^{\gamma_t}}{\sum_j \tilde{w}_t(j)^{\gamma_t}}$

$w_t(i)$ is the final weighting for location $i$ at time $t$
$\tilde{w}_t(i)$ is the rotated weighting for location $i$ at time $t$
$\gamma_t$ is the sharpening parameter at time $t$

The paper presents several experiments to evaluate the NTM's ability to learn algorithms. These include:

Copy Task: The NTM learns to copy sequences of binary vectors, demonstrating its ability to store and recall information over long time periods. The NTM significantly outperforms LSTM in this task and generalizes well to longer sequences than seen during training.
Repeat Copy Task: The NTM learns to repeat a sequence of binary vectors a specified number of times, showcasing its ability to learn nested functions or "for loops". While both NTM and LSTM can solve the task, NTM generalizes better to longer sequences and a higher number of repetitions.
Associative Recall: The NTM learns to recall the next item in a sequence when queried with a previous item, demonstrating its capability for learning indirection and managing more complex data structures. NTM learns significantly faster than LSTM and generalizes much better to longer sequences.
Dynamic N-Grams: The NTM is tested on its ability to rapidly adapt to new predictive distributions by emulating a conventional N-Gram model. The NTM achieves a small but significant performance advantage over LSTM.
Priority Sort: The NTM learns to sort a sequence of binary vectors based on their priority ratings, demonstrating its ability to perform elementary algorithms. The NTM with both feedforward and LSTM controllers substantially outperforms LSTM on this task.

Across these experiments, the NTM consistently demonstrates superior performance compared to LSTM, particularly in tasks requiring long-term memory, algorithmic reasoning, and generalization. The paper also analyzes the internal operations of the NTM, revealing that it learns compact internal programs and utilizes its memory in a manner analogous to how a human programmer would approach similar tasks.

PDF Markdown

Related Papers

FeUdal Networks for Hierarchical Reinforcement Learning (2017)
Generalization Bounds for Gradient Methods via Discrete and Continuous Prior (2022)
Implementing Neural Turing Machines (2018)
Neural Turing Machines: Convergence of Copy Tasks (2016)
Continuous Thought Machines (2025)

Tweets

https://twitter.com/__dwrodri/status/1744452947050664364

https://twitter.com/bronzeagepapi/status/1906277535480152394

https://twitter.com/cosminnegruseri/status/1861983977772626045

https://twitter.com/HlibIvanov/status/1744670580576501785

https://twitter.com/YadKonrad/status/1846239361567711393

https://twitter.com/kastnerkyle/status/1864300840619499801

YouTube

Show All Videos