Calibrating entropy-triggered retrieval and amortizing multi-query costs in RAG

Determine how to calibrate token-level entropy thresholds for uncertainty-triggered retrieval in Retrieval-Augmented Generation—specifically in approaches such as FLARE and RIND+QFS—across different application domains, and develop strategies to amortize the computational and latency costs of issuing multiple reformulated queries under tight end-to-end latency budgets.

Background

The paper reviews prompting and query strategies that make retrieval adaptive, including methods that trigger retrieval only when model uncertainty spikes (e.g., FLARE, RIND+QFS). These approaches improve recall-precision balance and control latency but introduce new design choices: when to trigger retrieval and how many additional queries to issue.

The authors note that while uncertainty-aware triggering avoids unnecessary lookups, issuing multiple queries can inflate latency and noise unless carefully managed. Consequently, calibrating entropy thresholds that govern when to retrieve and developing principled ways to amortize multi-query costs are identified as unresolved issues.

References

Open questions include how to calibrate entropy thresholds across domains and how to amortise multi-query costs under tight latency budgets.

— A Systematic Literature Review of Retrieval-Augmented Generation: Techniques, Metrics, and Challenges (2508.06401 - Brown et al., 8 Aug 2025) in Section “What are the innovative methods and approaches compared to the standard retrieval augmented generation?”, Subsubsection “Prompting and Query Strategies—Query reformulation, expansion, and selective triggering of queries”

Calibrating entropy-triggered retrieval and amortizing multi-query costs in RAG

Background

References

Related Problems