Human-like Episodic Memory for Infinite Context LLMs (2407.09450v2)

Published 12 Jul 2024 in cs.AI, cs.CL, cs.LG, and q-bio.NC

Abstract: LLMs have shown remarkable capabilities, but still struggle with processing extensive contexts, limiting their ability to maintain coherence and accuracy over long sequences. In contrast, the human brain excels at organising and retrieving episodic experiences across vast temporal scales, spanning a lifetime. In this work, we introduce EM-LLM, a novel approach that integrates key aspects of human episodic memory and event cognition into LLMs with no fine-tuning, enabling them to handle practically infinite context lengths while maintaining computational efficiency. EM-LLM organises sequences of tokens into coherent episodic events using a combination of Bayesian surprise and graph-theoretic boundary refinement in an online fashion. When needed, these events are retrieved through a two-stage memory process, combining similarity-based and temporally contiguous retrieval for efficient and human-like access to relevant information. Experiments on the LongBench and InfiniteBench benchmarks demonstrate EM-LLM's superior performance, consistently outperforming the state-of-the-art retrieval model InfLLM across various baseline LLMs. In addition, EM-LLM outperforms its popular counterpart, RAG, in a wide range of tasks, while requiring similar resources. Notably, EM-LLM's performance even surpasses full-context models in most tasks, while successfully performing retrieval across 10 million tokens - a scale computationally infeasible for such models. Finally, our analysis reveals strong correlations between EM-LLM's event segmentation and human-perceived events, suggesting a bridge between this artificial system and its biological counterpart, thereby offering a novel computational framework for exploring human memory mechanisms.

PDF HTML Abstract

The paper introduces EM-LLM, a novel architecture designed to enhance the ability of LLMs to process extended contexts by integrating principles of human episodic memory and event cognition. EM-LLM addresses the limitations of existing LLMs in maintaining coherence and accuracy over long sequences, which stems from the challenges in Transformer-based architectures, such as difficulties in extrapolating to contexts longer than their training window size and the computational cost of softmax attention over extended token sequences.

The core idea is to segment sequences of tokens into episodic events, drawing inspiration from how the human brain organizes and retrieves episodic experiences. This segmentation is achieved through a combination of Bayesian surprise and graph-theoretic boundary refinement. The episodic events are then retrieved using a two-stage memory process that combines similarity-based and temporally contiguous retrieval, mimicking the way humans access relevant information.

The architecture of EM-LLM divides the context into initial tokens, evicted tokens, and local context. The local context, which contains the most recent tokens, resides within the typical context window of the underlying LLM and utilizes full softmax attention. The evicted tokens, which constitute the majority of past tokens, are managed by the memory model, while the initial tokens act as attention sinks.

The memory formation process involves segmenting the sequence of tokens into individual memory units that represent episodic events. Boundaries are initially determined dynamically based on the surprise of the model during inference, quantified by the negative log-likelihood of observing the current token given the previous tokens:

$-\log P(x_t|x_1,...,x_{t-1};\theta) > T$ ,

where:

$P(x_t|x_1,...,x_{t-1};\theta)$ is the conditional probability of token $x_t$ given the preceding tokens $x_1$ to $x_{t-1}$ and model parameters $\theta$ ,
$T$ is a threshold value.

The threshold $T$ is defined as:

$T = \mu_{t-\tau} + \gamma\sigma_{t-\tau}$ ,

where:

$\mu_{t-\tau:t}$ is the mean surprise over a window offset by $\tau$ ,
$\sigma_{t-\tau:t}^2$ is the variance of surprise over the same window,
$\gamma$ is a scaling factor.

These boundaries are then refined to maximize cohesion within memory units and separation of memory content across them. The refinement leverages graph-theoretic metrics, treating the similarity between attention keys as a weighted adjacency matrix.

Two graph-clustering metrics are employed: modularity and conductance. Modularity, $f_M(A^h,\mathcal{B})$ , is defined as:

$f_M(A^h,\mathcal{B}) = \frac{1}{4m} \sum_{i,j} \left[A^h_{ij} - \frac{\sum_{i}A^h_{ij} \cdot \sum_{j}A^h_{ij}}{2m}\right] \delta(c_i, c_j)$

where:

$A^h$ is the adjacency matrix for attention head $h$ ,
$\mathcal{B}$ is the set of event boundaries,
$m$ is the total edge weight in the graph,
$c_i$ is the community to which node $i$ is assigned,
$\delta(c_i, c_j)$ is the Kronecker delta function.

Conductance, $f_C(A^h,\mathcal{B})$ , is defined as:

$f_C(A^h,\mathcal{B}) = \min_{S\in V}\frac{\sum_{i \in S,j \notin S} A^h_{ij}}{\min(\text{vol}(S), \text{vol}(V \setminus S))},$

where:

$S = \{b_i,b_i+1,...,b_{i+1}\}$ is a subset of all nodes $V = \{b_1,b_1+1,...,b_k\}$ in the induced graph, with $b_i \in \mathcal{B}$ ,
$\text{vol}(S) = \sum_{i\in S,j\in S}A_{ij}$ ,
$\text{vol}(V\setminus S) = \sum_{i\notin S,j\notin S}A_{ij}$ .

For memory retrieval, the approach integrates similarity-based retrieval with mechanisms that facilitate temporal contiguity and asymmetry effects. The memory retrieval process employs a two-stage mechanism: retrieving $k_s$ events using k-NN search based on dot product similarity to form a similarity buffer, and enqueuing neighboring events into a contiguity buffer of size $k_c$ .

The model was evaluated on the LongBench dataset, and it outperformed the state-of-the-art InfLLM model, achieving an overall relative improvement of $4.3\%$ . Notably, the model achieved a $33\%$ improvement on the PassageRetrieval task, which requires accurate recall of detailed information from a large context.

The paper also presents an analysis of human-annotated podcast scripts, which reveals strong correlations between LLM's attention heads and human-perceived event structures. Segmentation quality and correlation with human data were compared using modularity and conductance, as well as the ratio between intra- and inter-community similarity, defined as:

$\text{I/IS} \equiv \frac{\text{intra}}{\text{inter}}$

where:

$\text{intra} = \sum_{i\in S,j\in S}A_{ij}$ ,
$\text{inter} = \sum_{i\in S,j\notin S}A_{ij}$ ,
$S$ represents the tokens in a community.

The results indicate that surprise-only segmentation achieves similar results to human segmentation, and the addition of the refinement algorithm improves performance. Also, surprise-based methods identify event boundaries that are closest to those perceived by humans.

The authors compared their method with other long-context models, including methods that address the limitations of softmax attention, positional encodings, and computational efficiency. Unlike InfLLM, which segments the context into fixed-size memory units, EM-LLM dynamically determines token groupings, similar to human memory formation.

The authors suggest that future research could explore extending the surprise-based segmentation and boundary refinement to operate at each layer of the Transformer independently, and explore how EM-LLM could be utilized to enable imagination and future thinking.