Papers
Topics
Authors
Recent
Search
2000 character limit reached

BABILong Benchmark

Updated 3 February 2026
  • BABILong Benchmark is a comprehensive evaluation suite that assesses LLMs and memory-augmented architectures by testing multi-hop inference over ultra-long, noisy texts.
  • It scales context lengths from thousands to 50 million tokens, enforcing exact-match evaluation on fact retrieval and compositional reasoning tasks.
  • Empirical results illustrate that memory-augmented models, particularly ARMT, maintain high accuracy compared to standard and retrieval-augmented transformers under extreme context conditions.

The BABILong benchmark is a generative evaluation suite explicitly designed to assess and stress-test the ability of LLMs and memory-augmented neural architectures to retrieve and reason over a small number of "needle" facts embedded within extremely long, noisy natural language contexts. By scaling input sequences to tens of millions of tokens, BABILong probes both the effective context utilization and multi-hop reasoning capabilities of current and future neural models, revealing the limitations of standard attention-based approaches and the advances enabled by recurrent memory mechanisms (Kuratov et al., 2024, Kuratov et al., 2024).

1. Formal Definition and Dataset Construction

Each BABILong sample comprises an L-token "haystack" XX, assembled by interleaving kk supporting facts S=(s1,…,sk)S=(s_1, \ldots, s_k)—instantiated from bAbI task templates—within background text drawn from the PG-19 corpus. For a given question qq generated from SS, the model is tasked to output a gold answer yy. The background sentences, occupying the vast majority of tokens, are grammatically coherent and indistinguishable from the embedded facts, enforcing realistic retrieval difficulty at scale. The number of supporting facts kk is determined by the task (1 to 3 in QA1–QA3, up to three-argument chaining in QA5 and beyond) (Kuratov et al., 2024, Kuratov et al., 2024, Rodkin et al., 2024).

Sequence lengths LL are scaled from thousands up to Lmaxā‰ˆ50Ɨ106L_{max}\approx 50\times10^6 tokens, with pre-defined evaluation splits at L∈{4 K,32 K,128 K,500 K,1 M,10 M,50 M}L\in\{4\,\mathrm{K},32\,\mathrm{K},128\,\mathrm{K},500\,\mathrm{K},1\,\mathrm{M},10\,\mathrm{M},50\,\mathrm{M}\}, and are extensible to any longer LL as needed (Kuratov et al., 2024, Rodkin et al., 2024). Each sample's answer is strictly templated, facilitating exact-match evaluation.

2. Task Typology and Reasoning Complexity

BABILong builds directly on the 20 bAbI QA tasks, grouped by distinct inferential and retrieval demands. Key subtypes include:

  • Fact-chaining (QA1–QA3): Single (QA1), two-hop (QA2), and three-hop (QA3) supporting fact composition.
  • Deduction with relations (QA4–QA5): Two-argument (QA4) and three-argument (QA5) relational tasks.
  • Set, counting, boolean, negation, and temporal (QA6–QA20): Counting, list/set membership, coreference, multi-step induction/deduction, negation, uncertainty, path-finding, and temporal chaining.

Each instance requires the model to locate and correctly combine supporting sentences distributed within the haystack (see Table 1 for representative examples and reasoning types) (Kuratov et al., 2024, Kuratov et al., 2024).

Task Group ID Range Example Reasoning Type
Fact chaining QA1–QA3 Multi-hop retrieval
Deduction/relations QA4–QA5 2–3 argument reasoning
Set/count/negation/etc. QA6–QA20 Boolean, counting, temporal

Table 1: BABILong reasoning family overview.

BABILong's challenge is compounded by the dominance of distractor tokens, enabling controlled measurement of both simple retrieval (QA1) and compositional, multi-hop reasoning (up to QA20) as context size increases (Kuratov et al., 2024).

3. Evaluation Protocol and Metrics

The principal performance metric is exact-match accuracy, defined as

Acc(L)=1∣DLāˆ£āˆ‘(X,q,y)∈DL1[M(X,q)=y]\mathrm{Acc}(L) = \frac{1}{|D_L|} \sum_{(X,q,y) \in D_L} \mathbf{1}[M(X,q) = y]

for a test set DLD_L at context length LL (Kuratov et al., 2024, Kuratov et al., 2024). For tasks requiring multi-token answers (e.g., list/set output), standard F1 is used. Retrieval-augmented setups are additionally evaluated using Recall@kk on whether all requisite facts are present in the top-kk retrieved batches.

Effective context utilization L0.85L_{0.85} is defined as the largest LL for which Acc(L)≄0.85\mathrm{Acc}(L) \geq 0.85. Performance is considered "failed" for Acc(L)<0.3\mathrm{Acc}(L)<0.3, establishing concrete cutoffs for model breakdown under extreme context scaling (Kuratov et al., 2024).

Evaluation contexts span both zero-shot and fine-tuned settings, covering:

4. Empirical Results: Model Performance and Scalability

BABILong exposes a pronounced gap in long-context reasoning among model families. The key findings are:

  • Standard LLMs maintain ≄85%\geq 85\% accuracy on QA1 for L≤4 KL\leq 4\,\mathrm{K}; performance for k>1k>1 supporting facts degrades much earlier, with collapse typically beyond L=0 KL=0\,\mathrm{K} for QA3 (three-hop) tasks (Kuratov et al., 2024).
  • Retrieval-Augmented Generation (RAG) architectures, despite high recall@kk for ground-truth facts, do not improve end-to-end QA accuracy beyond ā‰ˆ60%\approx60\%, with performance relatively flat as LL increases. For multi-hop tasks, RAG approaches random guessing levels, attributed to the inability to retrieve all and only relevant chunks (Kuratov et al., 2024, Kuratov et al., 2024).
  • Memory-augmented transformers (RMT, Mamba, ARMT) generalize substantially better. RMT-137M and Mamba-130M models, trained on 16K-token segments, sustain high (>90%>90\%) accuracy on QA1–QA5 well beyond training lengths. Extrapolation to L=1 ML=1\,\mathrm{M}, 10 M10\,\mathrm{M}, and 11 M11\,\mathrm{M} tokens is achieved with only marginal performance drop for QA1 (Kuratov et al., 2024, Rodkin et al., 2024).
  • Associative Recurrent Memory Transformer (ARMT) sets the reported state-of-the-art, achieving a best-run QA1 accuracy of 79.9%79.9\% at L=50 ML=50\,\mathrm{M}, with three-task (QA1–QA3) compositional reasoning still above random for extreme lengths (see Table 2) (Rodkin et al., 2024).
Model QA1 @ 64K QA1 @ 1M QA1 @ 10M QA1 @ 50M
ARMT-145M 100.0 98.5 89.4 79.9*
RMT-137M 100.0 76.4 — —
Mamba-130M 97.2 — — —

Table 2: BABILong QA1 exact-match results (best run; * denotes best overall performance) (Rodkin et al., 2024).

A plausible implication is that recurrent memory—especially with associative/fast-weights updates—enables neural models to retain and retrieve orders of magnitude more information than standard sequential or retrieval-augmented transformers.

5. Architectural Insights and Ablation Findings

Memory architectures with segment-level recurrence (RMT) or additional associative fast-weights memory (ARMT) are central to BABILong's breakthroughs in ultra-long context. Key components:

  • Segmented processing: Context is split into disjoint segments (e.g., S=512S=512 tokens); local self-attention is performed within segments.
  • Memory tokens and fast-weights: Memory summary tokens Msā„“M_s^\ell per layer are associated with quasi-linear key–value matrices Asā„“A_s^\ell and normalization vectors zsā„“z_s^\ell. New information is integrated via delta-rule updates, leveraging the DPFP-3 nonlinearity (Rodkin et al., 2024).
  • Gamma-correction: Proper normalization of the zz-vector is crucial to avoid catastrophic forgetting as the number of memory writes increases beyond training scope (Rodkin et al., 2024).

In ablations, associative updates confer significant capacity gains over plain recurrence. Replacing ARMT's associative update with plain parallel-memory storage yields no improvement; gamma-corrected normalization is necessary for durable memory over hundreds of updates (Rodkin et al., 2024).

6. Limitations, Generalizability, and Future Directions

BABILong's generative protocol (random insertion of facts into natural-language background) is inherently leak-proof and scales to arbitrary LL. However, BARILong tasks rely on algorithmically simple bAbI-style reasoning, with background text restricted to PG-19 and, in some settings, Wikipedia. Application to domains with differing fact/noise distributions (scientific, legal) or more complex templates remains an open avenue (Kuratov et al., 2024, Kuratov et al., 2024).

A principal limitation of segment-level recurrent models, including ARMT, is the inherent sequentiality of recurrence at inference, making them less efficient than fully parallel SSM-based architectures for medium-length contexts. Furthermore, even memory-augmented transformers tend to bias toward recent segments, constraining extrapolation for certain language modeling tasks; cross-entropy on real text for ARMT matches conventional RMT despite theoretical gains in associative capacity (Rodkin et al., 2024).

A plausible implication is that retrieval and recurrence must be further integrated—potentially via trainable hybrid modules or adapter-style techniques—to match or surpass human-scale long-context utilization in future neural systems.

7. Availability and Resources

BABILong data, code, and evaluation resources are open-source under the Apache 2.0 license, supporting full reproducibility and extensibility to further context scales and reasoning types. Datasets and code reside at:

All experimental details, implementation curricula, prompts, and model predictions are provided to facilitate benchmarking and extension (Kuratov et al., 2024).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to BABILong Benchmark.