BABILong Benchmark

Updated 3 February 2026

BABILong Benchmark is a comprehensive evaluation suite that assesses LLMs and memory-augmented architectures by testing multi-hop inference over ultra-long, noisy texts.
It scales context lengths from thousands to 50 million tokens, enforcing exact-match evaluation on fact retrieval and compositional reasoning tasks.
Empirical results illustrate that memory-augmented models, particularly ARMT, maintain high accuracy compared to standard and retrieval-augmented transformers under extreme context conditions.

The BABILong benchmark is a generative evaluation suite explicitly designed to assess and stress-test the ability of LLMs and memory-augmented neural architectures to retrieve and reason over a small number of "needle" facts embedded within extremely long, noisy natural language contexts. By scaling input sequences to tens of millions of tokens, BABILong probes both the effective context utilization and multi-hop reasoning capabilities of current and future neural models, revealing the limitations of standard attention-based approaches and the advances enabled by recurrent memory mechanisms (Kuratov et al., 2024, Kuratov et al., 2024).

1. Formal Definition and Dataset Construction

Each BABILong sample comprises an L-token "haystack" $X$ , assembled by interleaving $k$ supporting facts $S=(s_1, \ldots, s_k)$ —instantiated from bAbI task templates—within background text drawn from the PG-19 corpus. For a given question $q$ generated from $S$ , the model is tasked to output a gold answer $y$ . The background sentences, occupying the vast majority of tokens, are grammatically coherent and indistinguishable from the embedded facts, enforcing realistic retrieval difficulty at scale. The number of supporting facts $k$ is determined by the task (1 to 3 in QA1–QA3, up to three-argument chaining in QA5 and beyond) (Kuratov et al., 2024, Kuratov et al., 2024, Rodkin et al., 2024).

Sequence lengths $L$ are scaled from thousands up to $L_{max}\approx 50\times10^6$ tokens, with pre-defined evaluation splits at $L\in\{4\,\mathrm{K},32\,\mathrm{K},128\,\mathrm{K},500\,\mathrm{K},1\,\mathrm{M},10\,\mathrm{M},50\,\mathrm{M}\}$ , and are extensible to any longer $L$ as needed (Kuratov et al., 2024, Rodkin et al., 2024). Each sample's answer is strictly templated, facilitating exact-match evaluation.

2. Task Typology and Reasoning Complexity

BABILong builds directly on the 20 bAbI QA tasks, grouped by distinct inferential and retrieval demands. Key subtypes include:

Fact-chaining (QA1–QA3): Single (QA1), two-hop (QA2), and three-hop (QA3) supporting fact composition.
Deduction with relations (QA4–QA5): Two-argument (QA4) and three-argument (QA5) relational tasks.
Set, counting, boolean, negation, and temporal (QA6–QA20): Counting, list/set membership, coreference, multi-step induction/deduction, negation, uncertainty, path-finding, and temporal chaining.

Each instance requires the model to locate and correctly combine supporting sentences distributed within the haystack (see Table 1 for representative examples and reasoning types) (Kuratov et al., 2024, Kuratov et al., 2024).

Task Group	ID Range	Example Reasoning Type
Fact chaining	QA1–QA3	Multi-hop retrieval
Deduction/relations	QA4–QA5	2–3 argument reasoning
Set/count/negation/etc.	QA6–QA20	Boolean, counting, temporal

Table 1: BABILong reasoning family overview.

BABILong's challenge is compounded by the dominance of distractor tokens, enabling controlled measurement of both simple retrieval (QA1) and compositional, multi-hop reasoning (up to QA20) as context size increases (Kuratov et al., 2024).

3. Evaluation Protocol and Metrics

The principal performance metric is exact-match accuracy, defined as

$\mathrm{Acc}(L) = \frac{1}{|D_L|} \sum_{(X,q,y) \in D_L} \mathbf{1}[M(X,q) = y]$

for a test set $D_L$ at context length $L$ (Kuratov et al., 2024, Kuratov et al., 2024). For tasks requiring multi-token answers (e.g., list/set output), standard F1 is used. Retrieval-augmented setups are additionally evaluated using Recall@ $k$ on whether all requisite facts are present in the top- $k$ retrieved batches.

Effective context utilization $L_{0.85}$ is defined as the largest $L$ for which $\mathrm{Acc}(L) \geq 0.85$ . Performance is considered "failed" for $\mathrm{Acc}(L)<0.3$ , establishing concrete cutoffs for model breakdown under extreme context scaling (Kuratov et al., 2024).

Evaluation contexts span both zero-shot and fine-tuned settings, covering:

Off-the-shelf open LLMs (LLaMA-3, Mistral, etc.)
Closed LLMs (GPT-4-Turbo)
Long-context fine-tuned LLMs
Context-extension (window-shifting, compression)
Retrieval-Augmented Generation (RAG)
Memory-augmented architectures (RMT, Mamba, ARMT) (Kuratov et al., 2024, Rodkin et al., 2024)

4. Empirical Results: Model Performance and Scalability

BABILong exposes a pronounced gap in long-context reasoning among model families. The key findings are:

Standard LLMs maintain $\geq 85\%$ accuracy on QA1 for $L\leq 4\,\mathrm{K}$ ; performance for $k>1$ supporting facts degrades much earlier, with collapse typically beyond $L=0\,\mathrm{K}$ for QA3 (three-hop) tasks (Kuratov et al., 2024).
Retrieval-Augmented Generation (RAG) architectures, despite high recall@ $k$ for ground-truth facts, do not improve end-to-end QA accuracy beyond $\approx60\%$ , with performance relatively flat as $L$ increases. For multi-hop tasks, RAG approaches random guessing levels, attributed to the inability to retrieve all and only relevant chunks (Kuratov et al., 2024, Kuratov et al., 2024).
Memory-augmented transformers (RMT, Mamba, ARMT) generalize substantially better. RMT-137M and Mamba-130M models, trained on 16K-token segments, sustain high ( $>90\%$ ) accuracy on QA1–QA5 well beyond training lengths. Extrapolation to $L=1\,\mathrm{M}$ , $10\,\mathrm{M}$ , and $11\,\mathrm{M}$ tokens is achieved with only marginal performance drop for QA1 (Kuratov et al., 2024, Rodkin et al., 2024).
Associative Recurrent Memory Transformer (ARMT) sets the reported state-of-the-art, achieving a best-run QA1 accuracy of $79.9\%$ at $L=50\,\mathrm{M}$ , with three-task (QA1–QA3) compositional reasoning still above random for extreme lengths (see Table 2) (Rodkin et al., 2024).

Model	QA1 @ 64K	QA1 @ 1M	QA1 @ 10M	QA1 @ 50M
ARMT-145M	100.0	98.5	89.4	79.9*
RMT-137M	100.0	76.4	—	—
Mamba-130M	97.2	—	—	—

Table 2: BABILong QA1 exact-match results (best run; * denotes best overall performance) (Rodkin et al., 2024).

A plausible implication is that recurrent memory—especially with associative/fast-weights updates—enables neural models to retain and retrieve orders of magnitude more information than standard sequential or retrieval-augmented transformers.

5. Architectural Insights and Ablation Findings

Memory architectures with segment-level recurrence (RMT) or additional associative fast-weights memory (ARMT) are central to BABILong's breakthroughs in ultra-long context. Key components:

Segmented processing: Context is split into disjoint segments (e.g., $S=512$ tokens); local self-attention is performed within segments.
Memory tokens and fast-weights: Memory summary tokens $M_s^\ell$ per layer are associated with quasi-linear key–value matrices $A_s^\ell$ and normalization vectors $z_s^\ell$ . New information is integrated via delta-rule updates, leveraging the DPFP-3 nonlinearity (Rodkin et al., 2024).
Gamma-correction: Proper normalization of the $z$ -vector is crucial to avoid catastrophic forgetting as the number of memory writes increases beyond training scope (Rodkin et al., 2024).

In ablations, associative updates confer significant capacity gains over plain recurrence. Replacing ARMT's associative update with plain parallel-memory storage yields no improvement; gamma-corrected normalization is necessary for durable memory over hundreds of updates (Rodkin et al., 2024).

6. Limitations, Generalizability, and Future Directions

BABILong's generative protocol (random insertion of facts into natural-language background) is inherently leak-proof and scales to arbitrary $L$ . However, BARILong tasks rely on algorithmically simple bAbI-style reasoning, with background text restricted to PG-19 and, in some settings, Wikipedia. Application to domains with differing fact/noise distributions (scientific, legal) or more complex templates remains an open avenue (Kuratov et al., 2024, Kuratov et al., 2024).

A principal limitation of segment-level recurrent models, including ARMT, is the inherent sequentiality of recurrence at inference, making them less efficient than fully parallel SSM-based architectures for medium-length contexts. Furthermore, even memory-augmented transformers tend to bias toward recent segments, constraining extrapolation for certain language modeling tasks; cross-entropy on real text for ARMT matches conventional RMT despite theoretical gains in associative capacity (Rodkin et al., 2024).

A plausible implication is that retrieval and recurrence must be further integrated—potentially via trainable hybrid modules or adapter-style techniques—to match or surpass human-scale long-context utilization in future neural systems.

7. Availability and Resources

BABILong data, code, and evaluation resources are open-source under the Apache 2.0 license, supporting full reproducibility and extensibility to further context scales and reasoning types. Datasets and code reside at:

GitHub: https://github.com/booydar/babilong
HuggingFace: https://huggingface.co/datasets/RMT-team/babilong

All experimental details, implementation curricula, prompts, and model predictions are provided to facilitate benchmarking and extension (Kuratov et al., 2024).

Markdown Upgrade to Chat

References (3)

BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack (2024)

In Search of Needles in a 11M Haystack: Recurrent Memory Finds What LLMs Miss (2024)

Associative Recurrent Memory Transformer (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to BABILong Benchmark.

BABILong Benchmark

1. Formal Definition and Dataset Construction

2. Task Typology and Reasoning Complexity

3. Evaluation Protocol and Metrics

4. Empirical Results: Model Performance and Scalability

5. Architectural Insights and Ablation Findings

6. Limitations, Generalizability, and Future Directions

7. Availability and Resources

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

BABILong Benchmark

1. Formal Definition and Dataset Construction

2. Task Typology and Reasoning Complexity

3. Evaluation Protocol and Metrics

4. Empirical Results: Model Performance and Scalability

5. Architectural Insights and Ablation Findings

6. Limitations, Generalizability, and Future Directions

7. Availability and Resources

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research