Papers
Topics
Authors
Recent
Search
2000 character limit reached

Forward-Looking Active Retrieval (FLARE)

Updated 28 February 2026
  • FLARE is a retrieval-augmented language model methodology that dynamically triggers targeted document retrieval based on low confidence in generated sentences.
  • It iteratively predicts the next sentence and conditionally retrieves external documents to supplement missing or uncertain factual details.
  • Empirical results demonstrate that FLARE significantly minimizes hallucinations and omissions, outperforming static retrieval approaches in long-form generation.

Forward-Looking Active Retrieval Augmented Generation (FLARE) is a retrieval-augmented LLM (LM) methodology addressing the limitations of fixed, single-shot retrieval in long-form, knowledge-intensive generation. Unlike standard approaches that fetch external knowledge only once per input, FLARE actively determines both when to retrieve new information and what queries will best supplement partial, low-confidence generations. FLARE iteratively predicts the next sentence, identifies uncertainty, and conditionally retrieves targeted documents to regenerate content, delivering output with improved factual accuracy and reduced hallucination risk (Jiang et al., 2023).

1. Motivation and Problem Statement

Existing retrieval-augmented LMs typically rely on retrieving supporting documents just once, based on the initial user input xx, before generating the complete sequence yy. While sufficient for short-form tasks, this method is inadequate for long-form, multi-aspect outputs such as essays, summaries, or multi-step question answering. The initial retrieval may overlook facts required by later segments, and unforeseen aspects frequently emerge during generation, necessitating new context as each part is written. Human writers routinely seek new information “just in time,” adapting retrieval to emergent needs to limit hallucination and omissions. FLARE introduces a systemized framework for LMs that similarly adapts retrieval dynamically, matching the evolving informational demands of generation (Jiang et al., 2023).

2. Algorithmic Framework

FLARE operates by anticipating future content and using this anticipation to drive targeted, confidence-triggered retrievals. The core procedure at each sentence step tt consists of:

  1. Generating a tentative next-sentence hypothesis s^t=LM([x,y<t])\hat{s}_t = LM([x, y_{<t}]) without consulting new evidence.
  2. Assessing token-level confidence via c(w)=PLM(wcontext)c(w) = P_{LM}(w\,|\,context). If all tokens in s^t\hat{s}_t have confidence above a threshold θ\theta, the sentence is accepted. If not, the algorithm:
    • Constructs a retrieval query qtq_t (either by masking low-confidence tokens or generating questions targeting low-confidence spans).
    • Retrieves the top-KK support documents Dqt=ret(qt)D_{q_t}=ret(q_t) using IR techniques (BM25 or Bing API).
    • Regenerates the sentence with this new context: st=LM([Dqt,x,y<t])s_t = LM([D_{q_t}, x, y_{<t}]).
  3. The process repeats, sentence-by-sentence, until completion.

Pseudocode formalizing the process is as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
initialize y = []; t = 1
while not LM signals end-of-generation:
    s_hat = LM([x, y])
    p_i = LM.probability(token_i | [x, y])
    if min(p_i) >= theta:
        s_t = s_hat
    else:
        q_t = f(s_hat)
        D_qt = ret(q_t, top=K)
        s_t = LM([D_qt, x, y])
    y = y + s_t
    t = t + 1
return y
This stepwise approach allows FLARE to precisely supplement missing facts as they become relevant, increasing resilience against both hallucination and omission.

3. Mathematical Formulation

Next-sentence generation is formalized by:

s^t=LM([x,y<t])\hat{s}_t = LM([x, y_{<t}])

where LM()LM(\cdot) denotes autoregressive decoding up to the first sentence delimiter. Token-level confidence for each hypothesis word ww is c(w)=PLM(wcontext)c(w) = P_{LM}(w\,|\,context). Retrieval triggers if there exists any token ii such that c(s^t[i])<θc(\hat{s}_t[i]) < \theta.

The retrieval query qtq_t is either constructed by masking low-confidence tokens at threshold β\beta (implicit) or by generating explicit natural-language questions for each low-confidence span. Retrieved documents dd are ranked by BM25(qt,d)(q_t, d) or Bing API rank. Sentence regeneration then conditions on [Dqt,x,y<t][D_{q_t}, x, y_{<t}].

4. Implementation Details

FLARE uses the text-davinci-003 LM accessed via the OpenAI API. Retrieval for Wikipedia-based tasks leverages BM25 over a DPR Wikipedia dump; for open-domain summarization, Bing Web Search API is employed, with Wikipedia domains excluded. Parameters for top-KK retrieved documents are set as follows:

Task Retriever K (docs)
2WikiMultihopQA BM25 on Wikipedia 2
StrategyQA, ASQA BM25 on Wikipedia 3
WikiAsp Bing Web Search 5

Thresholds θ\theta (for triggering retrieval) and β\beta (for masking) are tuned per development sets, with typical values θ{0.4,0.8}\theta \in \{0.4, 0.8\} and β0.4\beta \approx 0.4. Sentence boundaries are determined by generating up to 64 tokens and extracting the first complete sentence using the Punkt tokenizer. FLARE initiates retrieval for approximately 30–60% of sentences, markedly less than every-sentence or fixed-interval approaches (Jiang et al., 2023).

5. Empirical Evaluation and Results

FLARE is benchmarked against no retrieval, single-time retrieval, and passive multi-retrieval schemes (windowed, per-sentence, decompositional/question-decomposition strategies). Experiments are conducted on 500-example test sets with few-shot in-context learning, spanning the following tasks: 2WikiMultihopQA (multihop QA: EM/F1_1), StrategyQA (commonsense QA: EM), ASQA and ASQA-hint (long-form QA: EM, Disambig-F1_1, ROUGE-L), and WikiAsp (open-domain summarization: UniEval factuality, entity-F1_1, ROUGE).

Selected results:

Task Baseline EM / Factuality FLARE
2WikiMultihopQA No retr: 28.2 51.0
Single-time: 39.4
Q-decomp: 47.8
StrategyQA Single-time: 68.6 77.3
ASQA Single-time: 40.0 41.3
ASQA-hint Single-time: 43.2 46.2
WikiAsp UniEval Single-time: 52.4 53.4

FLARE consistently outperforms all passive baselines. This supports the claim that fine-grained, confidence-driven, forward-looking retrieval introduces factual support “just when needed” and “fetches what is needed next” for more accurate and less hallucinatory long-form output (Jiang et al., 2023).

6. Qualitative Analysis and Hallucination Avoidance

Evaluation on knowledge-sensitive prompts demonstrates a reduction in hallucinations. For example, with a prompt regarding Joe Biden’s education, the next-sentence hypothesis s^t\hat{s}_t “Joe Biden attended the University of Pennsylvania” triggers retrieval due to low confidence in the institution’s name. The masked query “Joe Biden attended the [MASK]” returns the correct document about the University of Delaware, and regeneration produces the factually accurate sentence. Similarly, in open-ended summary tasks, drifts or uncertain details trigger retrieval, providing correct dates and entities and minimizing the risk of fabricated content. These instances illustrate targeted retrieval's role in factual correction and enrichment (Jiang et al., 2023).

7. Significance and Implications

FLARE introduces a lightweight, inference-time paradigm that leverages next-sentence anticipation for active, context-sensitive retrieval. It adapts retrieval not only in timing but in content, dynamically aligning knowledge access with emergent uncertainties in generation. This approach bridges the gap between passive retrieval routines and the adaptive behavior exhibited in human information seeking, yielding marked improvements in factuality across various knowledge-intensive, long-form NLP tasks. A plausible implication is that similar forward-looking, confidence-triggered retrieval architectures could further enhance factual robustness in next-generation generative systems (Jiang et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Forward-Looking Active Retrieval Augmented Generation (FLARE).