Overparameterized Neural Networks Implement Associative Memory (1909.12362v2)

Published 26 Sep 2019 in cs.LG and stat.ML

Abstract: Identifying computational mechanisms for memorization and retrieval of data is a long-standing problem at the intersection of machine learning and neuroscience. Our main finding is that standard overparameterized deep neural networks trained using standard optimization methods implement such a mechanism for real-valued data. Empirically, we show that: (1) overparameterized autoencoders store training samples as attractors, and thus, iterating the learned map leads to sample recovery; (2) the same mechanism allows for encoding sequences of examples, and serves as an even more efficient mechanism for memory than autoencoding. Theoretically, we prove that when trained on a single example, autoencoders store the example as an attractor. Lastly, by treating a sequence encoder as a composition of maps, we prove that sequence encoding provides a more efficient mechanism for memory than autoencoding.

Citations (63)

View on Semantic Scholar

Summary

The paper demonstrates that overparameterized neural networks inherently implement associative memory by storing training data as attractors or limit cycles during standard optimization.
Specifically, overparameterized autoencoders store individual examples as attractors upon iteration, while sequence encoders store sequences as stable limit cycles, offering a more efficient memory.
This memory mechanism emerges automatically from training dynamics without requiring explicit energy function construction, supported by empirical evidence and theoretical proofs.

The paper investigates associative memory implemented by overparameterized neural networks. It posits that these networks, when trained with standard optimization, inherently develop mechanisms for data memorization and retrieval using real-valued data.

Key findings reported in the paper are:

Overparameterized autoencoders store training samples as attractors. Iterating the learned map leads to sample recovery.
Sequence encoders efficiently encode sequences of examples, serving as a more efficient memory mechanism than autoencoding.
Theoretically, autoencoders store single training examples as attractors. Sequence encoding is more efficient than autoencoding.

The paper introduces the concept of attractor networks, where patterns are stored as attractors of a dynamical system. Hopfield networks, which construct an energy function with local minima corresponding to desired patterns, serve as an early example. The iterative minimization of this energy function allows for the retrieval of stored patterns. While Hopfield networks are limited to binary patterns, subsequent works extended the idea of storing training examples as local minima for more complex data.

The authors contrast their approach with energy-based methods, noting that the storage and retrieval mechanisms emerge automatically from training, without requiring the construction and minimization of an energy function. They emphasize that interpolation alone is insufficient for implementing associative memory, as memorization requires the ability to recover training data and associate new inputs with training examples.

The work demonstrates that examples can be recovered by iterating the learned map. Given a set of training examples $\{x^{(i)}\}_{i=1}^{n} \subset \mathbb{R}^{d}$ and an overparameterized neural network implementing continuous functions $\mathcal{F} = \{f : \mathbb{R}^d \rightarrow \mathbb{R}^d\}$ , minimizing the autoencoding objective leads to training examples being stored as attractors:

$\arg\min_{f \in \mathcal{F} \;\sum\limits_{i=1}^n \|f(x^{(i)}) - x^{(i)}\|^2$

Attractors arise without specific regularization. The paper includes empirical evidence, such as a network storing 500 images from ImageNet-64 as attractors. A proof is presented for overparameterized networks trained on single examples. A modification of the objective leads to associative memory for sequences. Given a sequence of training examples $\{x^{(i)}\}_{i=1}^{n}\subset \mathbb{R}^d$ , minimizing the sequence encoding objective leads to the training sequence being stored as a stable limit cycle:

$\arg\min_{f \in \mathcal{F} \;\sum\limits_{i=1}^n \|f(x^{((i\hspace{-2mm}\mod n) + 1))}) - x^{(i)}\|^2$.

The paper posits that sequence encoding offers a more efficient mechanism for memorization and retrieval. By considering a sequence encoder as a composition of maps, the authors prove that sequence encoders are more contractive to a sequence of examples than autoencoders are to individual examples.

The paper relates this work to autoencoders used for manifold learning. Contractive and denoising autoencoders add regularizers to make functions contractive towards the training data. However, these autoencoders are typically used in the underparameterized regime, where they cannot interpolate training examples as fixed points. Overparameterized neural networks can interpolate the training data when trained with gradient descent methods. The paper takes a dynamical systems perspective to paper overparameterized autoencoders and sequence encoders. The work shows that overparameterized autoencoders/sequence encoders store training examples/sequences as fixed points/limit cycles and that these fixed points/limit cycles are attractors/stable. This mechanism does not require setting up an energy function and is a direct consequence of training an overparameterized network.

Relevant definitions from dynamical systems used in the analysis:

A point $x\in\mathbb{R}^d$ is a fixed point of $f$ if $f(x) = x$ .
A fixed point $x^* \in \mathbb{R}^d$ is an attractor of $f:\mathbb{R}^d\to \mathbb{R}^d$ if there exists an open neighborhood, $\mathcal{O}$ , of $x^*$ , such that for any $x \in \mathcal{O}$ , the sequence $\{f^k (x)\}_{k \in \mathbb{N}$ converges to $x^*$ as $k \rightarrow \infty$ .
A finite set $X^* = \{x^{(i)}\}_{i=1}^{n} \subset \mathbb{R}^d$ is a stable discrete limit cycle of a smooth function $f : \mathbb{R}^d \rightarrow \mathbb{R}^d$ if: (1) $f(x^{(i)}) = x^{(i\mod n) + 1} ~ \forall i \in \{1, \ldots n\}$ ; (2) There exists an open neighborhood, $\mathcal{O}$ , of $X^*$ such that for any $x \in \mathcal{O}$ , $X^*$ is the limit set of $\{f^{k}(x)\}_{k=1}^{\infty}$ .

The paper also provides propositions that give sufficient conditions for verifying if a fixed point is an attractor or a finite sequence of points forms a limit cycle.

The authors present empirical evidence that attractors arise in autoencoders across common architectures and optimization methods. They demonstrate an over-parameterized autoencoder storing 500 images from ImageNet-64 as attractors, achieved by training with depth 10, width 1024, and cosid nonlinearity on 500 training examples using the Adam optimizer to loss $\leq 10^{-8}$ . They verified that the 500 training images were stored as attractors by checking that the magnitudes of all eigenvalues of the Jacobian matrix at each example were less than 1. Iteration of the trained autoencoder map, starting from corrupted inputs, converges to individual training examples. The recovery rate of training examples under various forms of corruption is high.

The paper also discusses the possibility of spurious attractors, i.e. attractors other than the training examples. The authors performed a thorough analysis of the attractor phenomenon across a number of common architectures, optimization methods, and initialization schemes. They analyzed the number of training examples stored as attractors when trained on 100 black and white images from CIFAR10 under various conditions. Attractors arise in all settings for which training converged to a sufficiently low loss within 1,000,000 epochs.

The authors present an example of an overparameterized autoencoder storing training examples as attractors in the 2D setting, where the basins of attraction can easily be visualized.

Additionally, the paper demonstrates via examples that by modifying the autoencoder objective to encode sequences, they can implement a form of associative memory for sequences. A network was trained to encode 389 frames of size $128 \times 128$ from the Disney film “Steamboat Willie” by mapping frame $i$ to frame $i+1 \mod 389$ . Iterating the trained network starting from random noise yields the original video. The authors also encoded two 10-digit sequences from MNIST: one counting upwards from digit 0 to 9 and the other counting down from digit 9 to 0. The maximal eigenvalues of the Jacobian of the trained encoder composed 10 times is 0.0034 and 0.0033 for the images from the first and second sequence, respectively. Iteration from Gaussian noise leads to the recovery of both training sequences.

The experiments indicate that memorization and retrieval of training examples can be performed more efficiently through sequence encoding than autoencoding. Increasing network depth and width leads to an increase in the number of training examples / sequences stored as attractors / limit cycles.

The authors provide theoretical support by proving that, when trained on a single example, overparameterized autoencoders store the example as an attractor.

Let $f(z) = W_1 \phi(W_2 z)$ represent a 1-hidden layer autoencoder with elementwise nonlinearity $\phi$ and weights $W_1 \in \mathbb{R}^{k_0 \times k}$ and $W_2 \in \mathbb{R}^{k \times k_0}$ , applied to $z \in \mathbb{R}^{k_0}$ . The loss function is:

$\mathcal{L}(x, f) = \frac{1}{2} \| x - f(x) \|_2^2$ .

Two invariants of gradient descent are identified:

Invariant 1: If $W_1$ and $W_2$ are initialized to be rank 1 matrices of the form $x {u^{(0)}^T$ and $v^{(0)} x^T$ respectively, then $W_1^{(t)} = x{u^{(t)}^T$ and $W_2^{(t)} = v^{(t)} x^T$ for all time-steps $t>0$ .
Invariant 2: If, in addition, all weights in each row of $W_1$ and $W_2$ are initialized to be equal, they remain equal throughout training.

Theorems 1 and 2 provide closed form solutions for the weights and the top eigenvalue of the Jacobian at $x$ , denoted by $\lambda_1(\mathbf{J}(f(x)))$ .

The paper shows that attractors arise as a result of training and are not simply a consequence of interpolation by a neural network with a certain architecture.

The authors also prove that sequence encoding provides a more efficient mechanism for memory than autoencoding by analyzing sequence encoders as a composition of maps.

Theorems 3 and 4 generalize the earlier results to the case of training a network to map an example $x^{(i)} \in \mathbb{R}^{k_0}$ to an example $x^{(i+1)} \in \mathbb{R}^{k_0}$ . They also provide a sufficient condition for when the composition of these individual networks stores the sequence of training examples $\{x^{(i)}\}_{i=1}^{n}$ as a stable limit cycle.

PDF Markdown

Overparameterized Neural Networks Implement Associative Memory (1909.12362v2)

Summary

Related Papers