Haystack Engineering: Context Engineering for Heterogeneous and Agentic Long-Context Evaluation (2510.07414v1)

Published 8 Oct 2025 in cs.CL, cs.AI, and cs.IR

Abstract: Modern long-context LLMs perform well on synthetic "needle-in-a-haystack" (NIAH) benchmarks, but such tests overlook how noisy contexts arise from biased retrieval and agentic workflows. We argue that haystack engineering is necessary to construct noisy long contexts that faithfully capture key real-world factors -- distraction from heterogeneous biased retrievers and cascading errors in agentic workflows -- to test models' long-context robustness. We instantiate it through HaystackCraft, a new NIAH benchmark built on the full English Wikipedia hyperlink network with multi-hop questions. HaystackCraft evaluates how heterogeneous retrieval strategies (e.g., sparse, dense, hybrid, and graph-based) affect distractor composition, haystack ordering, and downstream LLM performance. HaystackCraft further extends NIAH to dynamic, LLM-dependent settings that simulate agentic operations, where models refine queries, reflect on their past reasonings, and decide when to stop. Experiments with 15 long-context models show that (1) while stronger dense retrievers can introduce more challenging distractors, graph-based reranking simultaneously improves retrieval effectiveness and mitigates more harmful distractors; (2) in agentic tests, even advanced models like Gemini 2.5 Pro and GPT-5 suffer cascading failures from self-generated distractors or struggle to perform early stops. These results highlight persistent challenges in agentic long-context reasoning and establish HaystackCraft as a valuable testbed for future progress.

Summary

The paper introduces HaystackCraft, a benchmark simulating heterogeneous retrieval methods and agentic workflows to evaluate LLM robustness.
It demonstrates that graph-based reranking effectively mitigates distractor effects, outperforming traditional sparse and dense retrieval techniques.
Evaluation results reveal significant performance degradation in LLMs during dynamic multi-round reasoning, highlighting the need for improved context engineering.

"Haystack Engineering: Context Engineering for Heterogeneous and Agentic Long-Context Evaluation" (2510.07414)

Overview

This paper introduces HaystackCraft, an advanced benchmark designed to evaluate the robustness of Long-context LLMs in scenarios that reflect real-world complexities better than traditional needle-in-a-haystack (NIAH) assessments. The benchmark considers retrieval-dependent distractor composition and dynamic, agentic workflows that lead to cascading errors. By incorporating retrieval strategies and multi-round, LLM-dependent settings, this benchmark aims to provide a more realistic view of long-context reasoning challenges faced by LLMs.

Challenges Addressed

HaystackCraft tackles core challenges that arise from two aspects:

Retrieval-Dependent Haystacks: Traditional NIAH benchmarks do not account for the variability introduced by different retrieval strategies. Real-world applications often deal with varied retrieval methods, such as sparse (e.g., BM25), dense (e.g., Qwen3-Embedding), hybrid, and graph-based retrievals. These methods significantly influence the noise levels in the context by varying the degree and nature of distractors included.
Agentic Error Propagation: In advanced LLM applications that rely on dynamic, agentic workflows, cascading errors can propagate through LLM reasoning processes. This happens when early retrieval or reasoning errors are compounded over multiple reasoning steps, potentially deviating from the original intent of a query and inflating the rankings of irrelevant distractors.
Figure 1: Overview of the core challenges that HaystackCraft addresses.

Retrieval and Haystack Composition

Retrieval Strategies

HaystackCraft evaluates the influence of distinct retrieval strategies by introducing distractors into context windows using various techniques:

Sparse Retrieval (e.g., BM25): Generates lexical similarities which often lead to including irrelevant, yet similar-sounding documents.
Dense Retrieval (e.g., Qwen3-Embedding): Surfaces semantically close documents that might not be factually accurate.
Hybrid and Graph-based Reranking (e.g., Personalized PageRank): These strategies leverage both content similarity and network-based document relevance to adjust the structure and content of the context provided to an LLM.

Retrieval Performance and Distractor Complexity

The benchmark reveals that graph-based reranking consistently outperforms other retrieval strategies by better mitigating the distractor effect through improved document ranking, especially for multi-hop questions.

Figure 2: Improved retrieval performance with reranking using PPR.

Dynamic, LLM-Dependent Contexts

Agentic Workflows

HaystackCraft introduces dynamic NIAH tests that assess how LLMs perform in multi-round reasoning scenarios. These scenarios simulate agentic operations where LLMs refine queries, reflect on their reasoning, and determine appropriate stopping points to prevent cascading errors.

Figure 3: Performance drops as context complexity increases with enforced multi-round reasoning.

Evaluation Findings

Empirical evaluations using HaystackCraft demonstrate that current LLMs, including advanced models like Gemini 2.5 Pro and GPT-5, face significant challenges in agentic workflows. Even with sophisticated reasoning models, the study observed substantial performance degradation as the complexity of contexts increased, especially in systems requiring self-generated reasoning progression.

Conclusion

HaystackCraft serves as a critical testbed for better understanding and improving the robustness of LLMs in realistic, retrieval-based, and agentic long-context settings. The findings highlight unresolved challenges in long-context reasoning and emphasize the importance of incorporating realistic distractor management and dynamic testing conditions in LLM evaluations. This benchmark provides valuable insights and pathways for future research and development in LLM contexts to bridge the gap between theoretical capabilities and practical utility.