Learning to Retrieve by Trying (LeReT)

Updated 16 December 2025

The paper introduces LeReT—a method that improves LLM retrieval through trial-and-error learning and reinforcement signals, achieving notable performance gains.
LeReT leverages reinforcement learning and self-supervised trial-based retrieval to iteratively refine query strategies and mitigate hallucination.
Empirical results demonstrate improved multi-hop reasoning and answer accuracy, outperforming traditional retrievers like BM25 and SBERT.

Learning to Retrieve by Trying (LeReT) refers to a spectrum of methods in retrieval-augmented LLMs that leverage trial-based and feedback-driven mechanisms for learning or adapting retrieval policies. The unifying principle is that LLM retrieval proficiency is not static but can be improved through repeated querying and trial-and-error, guided by explicit rewards or self-supervised signals. This paradigm encompasses algorithms grounded in reinforcement learning (RL) with preference optimization, as well as reward-free, trial-based self-supervised frameworks. The aim is to mitigate hallucination, improve multi-hop reasoning, and generalize retrieval to novel or complex information needs.

1. Problem Setting and Motivations

Retrieval-augmented generation (RAG) pipelines extend LLMs by providing explicit access to external information sources. However, standard LLMs often hallucinate search queries, especially in multi-hop tasks where complex decompositions and indirect queries are required. The effectiveness of RAG depends crucially on the capacity of the retrieval module to select or generate queries that surface relevant knowledge, rather than merely recalling memorized facts.

LeReT approaches target these deficiencies by enabling LLMs to iteratively "try" different queries, absorb feedback from success or failure, and adapt their retrieval strategies. This "learning by trying" is operationalized either with explicit reinforcement signals or by constructing dynamic, self-curated retrieval corpora through repeated hypothesis and validation cycles (Hsu et al., 30 Oct 2024, Li et al., 2 May 2025, Wang et al., 2023). The core problem formulation involves:

A multi-hop retrieval process: for question $u$ , at each hop $h=1…H$ , generate a query $q_h$ based on $u$ and retrieved context $C_{h-1}$ .
A black-box retriever $\mathcal{K}(q)$ that returns candidate documents.
An objective of maximizing end-to-end answer correctness through improved intermediate retrievals.

2. Reinforcement Learning-Enhanced Retrieval: The LeReT-IPO Framework

The principal LeReT algorithm (Hsu et al., 30 Oct 2024) frames query generation as a finite-horizon Markov decision process:

State: $s_h = (u, C_{h-1})$
Action: $a_h = q_h$
Transition: $C_h = C_{h-1} \cup \mathcal{K}(q_h)$
Reward: $r_h = R(C_h; u)$ , typically measuring retrieval quality versus gold supports.

Rather than using high-variance policy gradients, LeReT employs Identity Policy Optimization (IPO), a preference-based RL method:

Query pairs within an episode are compared (preference labeling).
For a base policy $\pi_\text{ref}$ and current policy $\pi_\phi$ , the implicit log-reward is $\tilde{r}(x,y) = \log\pi_\phi(y|x) - \log\pi_\text{ref}(y|x)$ .
The IPO loss is minimized:

$L_{IPO}(\phi) = \mathbb{E}_{(x,y_w,y_l)\sim D_p} \left[ \left( \tilde{r}(x,y_w) - \tilde{r}(x,y_l) - 0.5\,\tau^{-1} \right)^2 \right]$

This approach directly aligns policy log-probabilities to empirical reward gaps.

Sampling for exploration leverages diverse few-shot prompts generated via DSPy, which induces coverage over both high- and low-reward trajectories.

After RL training, a context distillation step (SFT) ensures a standardized prompt for generation, thus reducing deployment complexity.

3. Extensions: Reward-Free and Self-Supervised Trial-Based Retrieval

Retrial-Augmented Learning (RAL) (Li et al., 2 May 2025) generalizes "learning to retrieve by trying" to self-supervised, reward-free settings. RAL organizes LLM-driven knowledge acquisition into:

Hypothesis proposal: Generate and inject new informational hypotheses into a vector store.
Hypothesis validation: Evaluate these hypotheses retrospectively on new observations, producing validation records.
Experience consolidation: Aggregate and summarize validated trials into actionable knowledge.

This is formalized as a three-stage pipeline, in which each component of the database (hypotheses, validations, experiences) is continually updated based on observed LLM behavior, without relying on external rewards. The resulting RAG module acts as an autonomous, expanding retrieval memory, enabling robust decision-making and significantly reducing hallucination in domains with sparse or highly specialized data.

The self-supervised loss over a trajectory is:

$\mathcal{L}_t = -\left[ \log p(h_t|o_t,a_t,o_{t+1},H_t) + \log p(v_t|o_t,a_t,o_{t+1},h_t) + \log p(e_t|h_t,V_t) \right]$

4. LeReT in Iterative Dense Retriever Training and In-Context Example Selection

LeReT also denotes an iterative dense retriever construction framework for selecting high-quality in-context examples for LLMs (Wang et al., 2023). The workflow comprises:

A reward model: A cross-encoder (e.g., ELECTRA-base) predicts score $s_\theta(x,y,x',y')$ for candidate in-context examples.
A dense retriever: Bi-encoder (E5-base) trained by minimizing a joint distillation and contrastive loss, matching its candidate ranking to the reward model’s.
Iterative bootstrapping: The retriever and reward model are alternately updated; hard negatives are resampled according to current retrieval.

This procedure yields significant performance gains on a suite of 30 tasks spread over 9 categories, with LeReT (2 iterations) achieving an average score of 66.5% versus 61.3% for strong BM25 baselines. Gains generalize to held-out (unseen during training) tasks and various LLM backbones.

Method	Avg. Score (%)
Zero-shot	44.9
Random	57.9
BM25	61.3
SBERT	62.1
E5_base	61.4
EPR (re-impl.)	63.5
LeReT (1 iter)	65.7
LeReT (2 iter)	66.5

Ablation studies confirm that reward-modeling (with ground-truth $y^+$ inputs) is critical; raw LLM log-probs or only-contrastive losses are less effective.

5. Empirical Results and Practical Implications

LeReT achieves substantial improvement in retrieval and downstream answer accuracy over various strong baselines:

On HotpotQA and HoVer, LeReT increases retrieval recall by up to 29 points and exact match accuracy by up to 17 points (e.g., Llama 3.1 70B with LeReT: 53.5/64.9 EM/F1 vs. base 38.1/47.7) (Hsu et al., 30 Oct 2024).
Performance gains are observed across different retrievers (ColBERTv2, Azure AI Search) and LLM families (Llama, Gemma, GPT-4).
In dense retriever tasks, LeReT-trained retrievers outperform BM25, SBERT, and E5_base, and the benefits generalize to new tasks and model backbones (Wang et al., 2023).
In reward-free RL environments (LLM-PySC2), retrial-based retrievals reduce hallucination, accelerate convergence, and transfer across OOD tasks and LLMs (Li et al., 2 May 2025).

6. Limitations, Failure Modes, and Open Problems

LeReT with RL and explicit reward feedback assumes high-quality supervision (gold support docs or answer correctness). Performance with only weak reward signals degrades considerably (Hsu et al., 30 Oct 2024).
The cost of sampling multiple prompt trajectories and querying retrievers is nontrivial, though parallelizable.
Preference-based optimization and greedy hop-wise training can propagate credit only locally and may miss global optima when multi-hop dependencies are strong.
Self-supervised LeReT (RAL) relies on the LLM's ability to self-assess and validate, which introduces bias toward model-internal heuristics.
The RL framework does not co-adapt the retriever representation; only the query generation policy is optimized in (Hsu et al., 30 Oct 2024).

7. Broader Impact and Future Directions

The LeReT paradigm demonstrates that "learning to retrieve by trying" is effective in both supervised RL and reward-free self-supervision regimes. Several extensions and open research directions are noted:

Indirect supervision: Integrating answer verification, user preferences, or other weak signals in lieu of gold supports.
Joint retriever and query-tuner adaptation: Allowing co-training of both components to optimize the full retrieval-answering pipeline.
Application to non-retrieval LLM tools: Extending trial-based RL to code execution, database interaction, and multi-modal toolchains with clear per-action feedback (Hsu et al., 30 Oct 2024).
Enhanced sample efficiency and memory regularization in dynamic retrieval corpora (Li et al., 2 May 2025).

These developments underscore the importance of adaptive retrieval in LLM-centric systems, where trial-driven learning mitigates hallucination and enables transfer across both knowledge-intensive and decision-theoretic tasks. The "by trying" principle establishes a foundation for continual, feedback-responsive information access in future large-scale autonomous agents.