Papers
Topics
Authors
Recent
Search
2000 character limit reached

IterKey: Iterative Keyword Framework for RAG

Updated 16 May 2026
  • IterKey is an iterative, LLM-driven framework that generates and refines keyword sets to enhance BM25-based retrieval for accurate open-domain QA.
  • It employs a multi-stage process—including keyword generation, document retrieval, answer generation, and validation—to balance transparency with high performance.
  • Empirical evaluations show a 5–20 percentage-point boost in Exact Match and a 20% increase in top-5 recall over traditional BM25 methods.

IterKey is an iterative keyword-based framework for Retrieval-Augmented Generation (RAG) that leverages LLMs to reconcile the transparency of sparse retrieval methods with the high factual accuracy typically associated with dense retrieval. The core innovation of IterKey lies in its LLM-driven, multi-stage process for generating and refining interpretable keyword sets to enhance retrieval over a BM25 backend, providing both operator auditability and accuracy in open-domain question answering scenarios (Hayashi et al., 13 May 2025).

1. Motivation and Problem Setting

In RAG, LLMs handle information queries by consulting external documents, a strategy known to mitigate hallucinations and stale knowledge characteristic of closed-book architectures. Two principal paradigms exist: dense retrieval (using vector embedding similarity) and sparse retrieval (using token overlap, e.g., BM25). Dense retrieval systems such as Contriever, BGE, or E5 typically yield higher recall and answer accuracy but are opaque—offering no direct justification for why a particular document was retrieved. In contrast, sparse systems such as BM25 are interpretable, since document inclusion can be traced to explicit term matches, but are less effective at capturing semantic similarity or nuanced intent, and thus may underperform in accuracy.

IterKey is designed to address the tension between transparency and accuracy in retrieval. Its objective is to maintain interpretability (through the use of explicit, LLM-curated keywords in sparse retrieval) while recovering the recall and question answering (QA) accuracy benefits commonly attributed to dense models.

2. Framework and Iterative Procedure

The IterKey workflow proceeds in up to NN iterations, with each iteration consisting of three LLM-driven stages:

  1. Keyword Generation: The LLM receives the input query qq and outputs a set of candidate keywords Ki\mathcal{K}^i (where ii indexes the iteration).
  2. Document Retrieval & Answer Generation: An expanded query q+=q+Kiq^+ = q + \mathcal{K}^i is constructed. Top-kk documents Di\mathcal{D}^i are retrieved using BM25 over this query. The LLM then generates an answer aia^i conditioned on qq and Di\mathcal{D}^i.
  3. Answer Validation: The LLM is prompted to return a “True”/“False” judgment on whether qq0 is fully supported by qq1.

If validation yields “True,” the process terminates and returns qq2. If “False,” the LLM is re-prompted to perform Keyword Regeneration, refining the previous keyword set (possibly specializing or replacing tokens), and the process iterates. The algorithm allows for early stopping and ensures the minimum effective number of retrieval-answer cycles.

Pseudocode excerpt:

Ki\mathcal{K}^i6

3. Mathematical Foundations: BM25 Scoring in IterKey

The retrieval backbone in IterKey is classical BM25 scoring. For a document qq3 and query qq4:

qq5

where qq6 is the term frequency of token qq7 in qq8, qq9 the document length, Ki\mathcal{K}^i0 the average document length, Ki\mathcal{K}^i1 and Ki\mathcal{K}^i2 are tunable hyperparameters, and Ki\mathcal{K}^i3 is the inverse document frequency for token Ki\mathcal{K}^i4.

All refinement and validation in IterKey is implemented through prompt engineering; no trainable parameters beyond the LLM are introduced.

4. LLM-Based Keyword and Validation Prompts

IterKey exploits the instruction-following capabilities of modern LLMs for both stages of keyword selection and self-validation. The keyword prompts produce explicit, JSON-formatted token sets. Upon validation, the LLM is constrained to a binary “True” or “False,” tightly gating the iteration logic. Regeneration prompts are context-aware, leveraging both the original query and prior keywords, thus guiding the LLM to specialize search cues as needed (for example, moving from “spacecraft” to “Apollo 11 lunar module”). The effectiveness of these prompts is directly related to the LLM’s propensity for accurate instruction-following and self-assessment.

5. Experimental Protocol and Empirical Results

Experiments were conducted on four open-domain QA datasets: Natural Questions, EntityQA, WebQA, and HotpotQA, each with 500 zero-shot examples. Retrieval backends included BM25 and competitive dense models (Contriever, BGE, E5). Tested LLMs were Llama-3.1 (8B, 70B), Gemma-2 (9B), and Phi-3.5-mini (3.8B). Performance metrics were Exact Match (EM) for answer evaluation and top-Ki\mathcal{K}^i5 retrieval recall.

Key results are summarized below (EntityQA, Llama-3.1 8B example):

Method Exact Match (EM)
Vanilla (no retrieval) 33.6%
RAG(BM25, single-step) 54.0%
RAG(E5, dense retrieval) 52.9%
ITRG (iterative dense, E5) 60.6%
IterKey (BM25 iterative keywords) 61.0%

Across models and datasets, IterKey achieves 5–20 percentage-point EM improvements over BM25-based RAG, matching or slightly exceeding dense retrieval baselines and prior iterative dense refinement methods. Top-5 recall under BM25 rises by approximately 20 percentage points with IterKey's LLM-guided keywords (e.g., from 42% to 62% on EntityQA), demonstrating enhanced lexical coverage. The average number of retrieval-answer cycles is kept below 1.5 due to early stopping from LLM validation, making IterKey more efficient than dense iterative approaches.

6. Interpretability, Practical Significance, and Limitations

IterKey enables transparent, auditable retrieval: each document’s retrieval can be attributed to explicit LLM-generated keywords, allowing direct inspection and possible manual refinement of retrieval cues. This level of transparency is unattainable in dense retrieval, where high-dimensional embedding similarities obfuscate the selection rationale.

The system’s self-validation not only acts as an early stopping mechanism but also supports reliability: answers are output only when judged to be grounded in retrieved evidence. This design is particularly beneficial in domains such as search dashboards, legal research, and medical decision-support, where user trust, auditing, and debugging are critical.

Observed limitations include dependence on the LLM’s capacity for robust instruction-following and reliable answer validation—constraints evidenced by Gemma-2’s reduced EM gains attributed to weaker validation reliability. Additional iterations do incur a modest computational cost, but this is offset by the early-stopping mechanism.

IterKey empirically demonstrates that the iterative, LLM-enhanced generation of sparse cues can bridge the historical divide between interpretability and high task performance in RAG systems, providing a practical foundation for transparent, high-accuracy information access in real-world applications (Hayashi et al., 13 May 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to IterKey.