NeedleBench: Long-Context Retrieval & Reasoning
- NeedleBench is a synthetic framework for assessing bilingual (English and Chinese) LLM performance in long-context retrieval and reasoning tasks.
- It embeds controlled 'needle' facts within extensive 'haystack' texts at variable depths and context lengths to systematically test model limits.
- Empirical results reveal under-thinking issues and performance degradation in multi-fact retrieval and multi-step reasoning, prompting architectural improvements.
NeedleBench is a synthetic framework designed to rigorously assess the retrieval and reasoning performance of LLMs in bilingual (English and Chinese) long-context tasks with adaptively controlled context lengths and key fact embedding depths. The framework addresses limitations of prior benchmarks by embedding contrived ("needle") data points at precise locations within large volumes of filler (“haystack”) text drawn from curated corpora, enabling fine-grained evaluation of both simple retrieval and complex multi-step reasoning in information-sparse and information-dense contexts (Li et al., 2024).
1. Motivation and Benchmark Objectives
As LLM context windows expand into hundreds of thousands or millions of tokens, conventional “needle-in-a-haystack” tests—focused on retrieving a single fact hidden within filler—fail to reflect the requirements of practical long-context applications, which typically demand both the retrieval of multiple facts and multi-step logical reasoning across dispersed information. Existing benchmarks such as LongBench, PassKey, InfiniteBench, and NIAH do not systematically vary needle depth, the number of facts to retrieve, or the logical complexity required for reasoning. NeedleBench addresses these deficiencies through a suite of tasks structured to probe the boundaries of large-context understanding, scaling both retrieval and reasoning difficulty and assessing granular breakdowns in model performance (Li et al., 2024).
2. Synthetic Dataset Generation and Embedding Methodology
NeedleBench datasets are constructed from English haystack fillers sourced from PaulGrahamEssays and Chinese fillers from ChineseDomainModelingEval. Key facts (“needles”) are abstract, contrived statements guaranteed to have no overlap with pretraining data, and designed explicitly for controlled evaluation. The framework generates datasets for:
- Single-fact retrieval (S-RT): One needle is embedded at a specified depth.
- Multi-fact retrieval (M-RT): M distinct needles are embedded at varied depths.
- Multi-fact reasoning (M-RS): Uses derivations adapted from HotpotQA/ℛ⁴ℂ, translated as needed.
- Ancestral Trace Challenge (ATC): Builds continuous logical inference chains up to 19 steps with contrived familial relationships.
Context length is set in intervals from 4K up to 1M tokens, structured in length-intervals and depth-intervals. Length grid and depth grid for needle embedding are defined as:
Each needle’s depth is sampled uniformly from , thus the distribution of needle placement is: Language control uses to select source and needles.
3. Task Scenarios and Challenge Design
3.1 Information-Sparse Retrieval Tasks
- S-RT (Single Retrieval): One needle placed; prompt requests precisely the embedded fact.
- M-RT (Multi Retrieval): M needles at multiple depths; prompt requests ordered retrieval of all M facts.
Metrics focus on recall accuracy for S-RT, and fraction of correctly retrieved needles for M-RT.
3.2 Information-Dense Reasoning: Ancestral Trace Challenge
ATC constructs a multi-step inference task with a chain of first-order relationships (e.g., "A is B’s father, B is C’s mother, ..."), varying step lengths from 2 to 19. Prompts utilize randomized multiple-choice (via Circular-Eval) to mitigate guessing, and each instance is scored by the weighted ATC formula: with denoting performance on distinct problems per step count.
4. Adaptive Context Length and Depth Analysis
NeedleBench exhaustively tests combinations, generating a comprehensive two-dimensional grid of model performance by context length and fact depth. The standard pseudocode is:
1 2 3 4 5 6 7 |
for each L_max in {4K,8K,32K,128K,200K,1M}: compute {l_i}_(i=1..N_L) compute {d_{i,j}}_(j=1..N_D) for each (l_i, d_{i,j}) pair: generate haystack of length ≈ l_i insert needle(s) at depth d_{i,j} query model → record output |
This mechanism yields a heatmap detailing model breakdown points for retrieval and reasoning as complexity scales.
5. Evaluation Protocols and Under-Thinking Phenomenon
NeedleBench employs rigorous, task-specific metrics:
- Levenshtein-based Soft Score ($0$–$100$): For retrieval,
where = answer tokens, = reference tokens, = core keywords, = Levenshtein distance, .
- Precision / Recall / F1 (M-RT tasks):
- ATC Score: Fraction of correctly answered multiple-choice (post Circular-Eval), weighted as above.
- Under-Thinking Index: Quantifies premature reasoning cessation,
(e.g., $0.5$) indicates pronounced under-thinking—models retrieve the first fact but fail to continue.
6. Model Ecosystem and Experimental Conditions
Benchmarked models include: GPT-4 Turbo, Claude-3-Opus, GLM4-9B-Chat-1M, InternLM2.5-7B-Chat-1M, InternLM2-7B-200K, InternLM2-20B-200K, Qwen-1.5-72B-vLLM, Mixtral-8×7B, LLaMA-2 (7B/13B/70B), Baichuan2 (7B/13B), Gemma (2B/7B), Mistral-7B, Yi-6B, OrionStar-14B, DeepSeek-67B, Zephyr-7B. English and Chinese tasks are supported. Tokenization relies on OpenAI tiktoken; serving on LMDeploy (200K) and vLLM (1M). Infrastructure utilizes NVIDIA A100 servers, Python 3.10, PyTorch, and OpenCompass for orchestration and data management. All resources and data are accessible at https://github.com/open-compass/opencompass (Li et al., 2024).
7. Empirical Findings and Implications
NeedleBench reveals distinctive failure modes and strengths among long-context LLMs:
- Single-fact Retrieval: Remains robust up to 200K tokens for several models (InternLM2, Mixtral, Qwen-72B, GPT-4 Turbo, Claude-3).
- Multi-fact Retrieval: Performance deteriorates sharply when or context K; open-source models typically only retrieve the initial fact, with near-universal under-thinking () in 32K–200K contexts.
- Reasoning Tasks (M-RS): Incorporating three to four facts drops accuracy below 50%; multi-step reasoning proves challenging.
- Information-Dense Reasoning (ATC): All models—including GPT-4 Turbo and Claude-3—show steep degradation past 6–8 reasoning steps. Open-source models’ ATC accuracy falls below 10% for steps (K tokens). DeepSeek-67B is a notable exception, approaching API model performance.
- Model Specialization: InternLM2 variants excel in S-RT but are vulnerable to under-thinking in M-RT. Qwen and Mixtral exhibit balanced performance across S-RT, M-RT, and M-RS. GLM4-9B-Chat-1M supports extended contexts (1M tokens) but is acutely prompt-sensitive.
- Failure Cluster Analysis: With increasing context, retrieval failures arise predominantly from under-thinking rather than memory loss; improved prompting (e.g., explicit listing instructions) partially alleviates under-thinking.
A plausible implication is that future LLM architectures and tuning paradigms must address guidance for multi-step retrieval and reasoning, rather than simply expanding context windows.
8. Reproducibility, Extension, and Application Pipeline
NeedleBench’s procedure is modular; the evaluation protocol is captured with the following pseudocode:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
initialize OpenCompass for each model in MODEL_LIST: for each language ℓ in {EN, ZH}: for each task in {S-RT, M-RT, M-RS, ATC}: set task-specific parameters (M, reasoning steps, few-shots) for each L_max in LENGTH_SETTINGS: compute {l_i}, {d_{i,j}} for each l_i, d_{i,j} combination: context = sample_filler(ℓ, length≈l_i) needles = sample_needles(task, ℓ) context.insert(needles at depths d_{i,*}) prompt = build_prompt(task, context, question_spec) output = model.generate(prompt, max_tokens=…) score = evaluate(output, reference, task) aggregate_results() plot_heatmaps() |
Final scoring for aggregate benchmark performance uses: All datasets, code, scripts, and configuration are maintained under the OpenCompass umbrella, supporting extension to additional languages, model suites, or custom needle insertion distributions .
NeedleBench thus provides a critical and extensible toolset for practitioners and researchers seeking rigorous, bilingual evaluation of LLM retrieval and reasoning under long-context and variable depth conditions (Li et al., 2024).