Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Human-like Episodic Memory for Infinite Context LLMs (2407.09450v2)

Published 12 Jul 2024 in cs.AI, cs.CL, cs.LG, and q-bio.NC
Human-like Episodic Memory for Infinite Context LLMs

Abstract: LLMs have shown remarkable capabilities, but still struggle with processing extensive contexts, limiting their ability to maintain coherence and accuracy over long sequences. In contrast, the human brain excels at organising and retrieving episodic experiences across vast temporal scales, spanning a lifetime. In this work, we introduce EM-LLM, a novel approach that integrates key aspects of human episodic memory and event cognition into LLMs with no fine-tuning, enabling them to handle practically infinite context lengths while maintaining computational efficiency. EM-LLM organises sequences of tokens into coherent episodic events using a combination of Bayesian surprise and graph-theoretic boundary refinement in an online fashion. When needed, these events are retrieved through a two-stage memory process, combining similarity-based and temporally contiguous retrieval for efficient and human-like access to relevant information. Experiments on the LongBench and InfiniteBench benchmarks demonstrate EM-LLM's superior performance, consistently outperforming the state-of-the-art retrieval model InfLLM across various baseline LLMs. In addition, EM-LLM outperforms its popular counterpart, RAG, in a wide range of tasks, while requiring similar resources. Notably, EM-LLM's performance even surpasses full-context models in most tasks, while successfully performing retrieval across 10 million tokens - a scale computationally infeasible for such models. Finally, our analysis reveals strong correlations between EM-LLM's event segmentation and human-perceived events, suggesting a bridge between this artificial system and its biological counterpart, thereby offering a novel computational framework for exploring human memory mechanisms.

The paper introduces EM-LLM, a novel architecture designed to enhance the ability of LLMs to process extended contexts by integrating principles of human episodic memory and event cognition. EM-LLM addresses the limitations of existing LLMs in maintaining coherence and accuracy over long sequences, which stems from the challenges in Transformer-based architectures, such as difficulties in extrapolating to contexts longer than their training window size and the computational cost of softmax attention over extended token sequences.

The core idea is to segment sequences of tokens into episodic events, drawing inspiration from how the human brain organizes and retrieves episodic experiences. This segmentation is achieved through a combination of Bayesian surprise and graph-theoretic boundary refinement. The episodic events are then retrieved using a two-stage memory process that combines similarity-based and temporally contiguous retrieval, mimicking the way humans access relevant information.

The architecture of EM-LLM divides the context into initial tokens, evicted tokens, and local context. The local context, which contains the most recent tokens, resides within the typical context window of the underlying LLM and utilizes full softmax attention. The evicted tokens, which constitute the majority of past tokens, are managed by the memory model, while the initial tokens act as attention sinks.

The memory formation process involves segmenting the sequence of tokens into individual memory units that represent episodic events. Boundaries are initially determined dynamically based on the surprise of the model during inference, quantified by the negative log-likelihood of observing the current token given the previous tokens:

logP(xtx1,...,xt1;θ)>T-\log P(x_t|x_1,...,x_{t-1};\theta) > T,

where:

  • P(xtx1,...,xt1;θ)P(x_t|x_1,...,x_{t-1};\theta) is the conditional probability of token xtx_t given the preceding tokens x1x_1 to xt1x_{t-1} and model parameters θ\theta,
  • TT is a threshold value.

The threshold TT is defined as:

T=μtτ+γσtτT = \mu_{t-\tau} + \gamma\sigma_{t-\tau},

where:

  • μtτ:t\mu_{t-\tau:t} is the mean surprise over a window offset by τ\tau,
  • σtτ:t2\sigma_{t-\tau:t}^2 is the variance of surprise over the same window,
  • γ\gamma is a scaling factor.

These boundaries are then refined to maximize cohesion within memory units and separation of memory content across them. The refinement leverages graph-theoretic metrics, treating the similarity between attention keys as a weighted adjacency matrix.

Two graph-clustering metrics are employed: modularity and conductance. Modularity, fM(Ah,B)f_M(A^h,\mathcal{B}), is defined as:

fM(Ah,B)=14mi,j[AijhiAijhjAijh2m]δ(ci,cj)f_M(A^h,\mathcal{B}) = \frac{1}{4m} \sum_{i,j} \left[A^h_{ij} - \frac{\sum_{i}A^h_{ij} \cdot \sum_{j}A^h_{ij}}{2m}\right] \delta(c_i, c_j)

where:

  • AhA^h is the adjacency matrix for attention head hh,
  • B\mathcal{B} is the set of event boundaries,
  • mm is the total edge weight in the graph,
  • cic_i is the community to which node ii is assigned,
  • δ(ci,cj)\delta(c_i, c_j) is the Kronecker delta function.

Conductance, fC(Ah,B)f_C(A^h,\mathcal{B}), is defined as:

fC(Ah,B)=minSViS,jSAijhmin(vol(S),vol(VS)),f_C(A^h,\mathcal{B}) = \min_{S\in V}\frac{\sum_{i \in S,j \notin S} A^h_{ij}}{\min(\text{vol}(S), \text{vol}(V \setminus S))},

where:

  • S={bi,bi+1,...,bi+1}S = \{b_i,b_i+1,...,b_{i+1}\} is a subset of all nodes V={b1,b1+1,...,bk}V = \{b_1,b_1+1,...,b_k\} in the induced graph, with biBb_i \in \mathcal{B},
  • vol(S)=iS,jSAij\text{vol}(S) = \sum_{i\in S,j\in S}A_{ij},
  • vol(VS)=iS,jSAij\text{vol}(V\setminus S) = \sum_{i\notin S,j\notin S}A_{ij}.

For memory retrieval, the approach integrates similarity-based retrieval with mechanisms that facilitate temporal contiguity and asymmetry effects. The memory retrieval process employs a two-stage mechanism: retrieving ksk_s events using k-NN search based on dot product similarity to form a similarity buffer, and enqueuing neighboring events into a contiguity buffer of size kck_c.

The model was evaluated on the LongBench dataset, and it outperformed the state-of-the-art InfLLM model, achieving an overall relative improvement of 4.3%4.3\%. Notably, the model achieved a 33%33\% improvement on the PassageRetrieval task, which requires accurate recall of detailed information from a large context.

The paper also presents an analysis of human-annotated podcast scripts, which reveals strong correlations between LLM's attention heads and human-perceived event structures. Segmentation quality and correlation with human data were compared using modularity and conductance, as well as the ratio between intra- and inter-community similarity, defined as:

I/ISintrainter\text{I/IS} \equiv \frac{\text{intra}}{\text{inter}}

where:

  • intra=iS,jSAij\text{intra} = \sum_{i\in S,j\in S}A_{ij},
  • inter=iS,jSAij\text{inter} = \sum_{i\in S,j\notin S}A_{ij},
  • SS represents the tokens in a community.

The results indicate that surprise-only segmentation achieves similar results to human segmentation, and the addition of the refinement algorithm improves performance. Also, surprise-based methods identify event boundaries that are closest to those perceived by humans.

The authors compared their method with other long-context models, including methods that address the limitations of softmax attention, positional encodings, and computational efficiency. Unlike InfLLM, which segments the context into fixed-size memory units, EM-LLM dynamically determines token groupings, similar to human memory formation.

The authors suggest that future research could explore extending the surprise-based segmentation and boundary refinement to operate at each layer of the Transformer independently, and explore how EM-LLM could be utilized to enable imagination and future thinking.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Zafeirios Fountas (16 papers)
  2. Martin A Benfeghoul (1 paper)
  3. Adnan Oomerjee (2 papers)
  4. Fenia Christopoulou (10 papers)
  5. Gerasimos Lampouras (22 papers)
  6. Haitham Bou-Ammar (30 papers)
  7. Jun Wang (990 papers)
Citations (10)
Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com