First-Occurrence Latent Reward (FOLR)
- The paper introduces FOLR, a mechanism assigning turn-level rewards at the precise moment the ground-truth is first retrieved, thereby crediting partial progress in multi-turn reasoning.
- FOLR mitigates process and intra-group homogenization by providing nonzero rewards for intermediate retrievals, which enhances the variance needed for effective advantage estimation.
- Empirical results on datasets like NaturalQuestions and TriviaQA show that TSPO with FOLR improves exact match scores by 13–24% over baselines, leading to more stable and effective policy updates.
The First-Occurrence Latent Reward (FOLR) is a turn-level reward assignment mechanism introduced within Turn-level Stage-aware Policy Optimization (TSPO) to address inefficiencies and optimization barriers encountered in multi-turn tool-augmented reasoning. FOLR’s central principle is the allocation of explicit reward at the process step where the ground-truth answer is first retrieved during iterative search, providing partial credit for demonstrated progress. This sectioned policy signal increases intra-group reward variance and resolves issues of homogenization endemic to conventional outcome-only reinforcement learning frameworks in this domain (Ma et al., 30 Jan 2026).
1. Problem Setting and Rationale
In multi-turn search-augmented generation tasks, an agent (often a LLM) interacts with external tools (e.g., search engines) to iteratively retrieve evidence and synthesize answers. The standard RL formalism defines the environment as a Markov Decision Process (MDP) with:
- State capturing dialogue history and accumulated tool outputs up to turn .
- Action as either a tool query or generative response.
- Transition dynamics updating states based on tool feedback .
- Reward usually given sparsely: only the final output is compared to the ground-truth, assigning if the model’s answer is correct and otherwise; all intermediate steps are rewardless.
Such outcome-level rewards entirely ignore intermediate reasoning and retrieval progress, leading to two homogenization phenomena:
- Process homogenization: All forms of procedural progress are erased when the final answer is incorrect, even if the model successfully retrieved relevant evidence.
- Intra-group homogenization: Group-based RL methods (e.g., Group Relative Policy Optimization, GRPO) compute normalized advantages within rollout groups. If all return zero, as with uniform failure under outcome-level rewards, advantage gradients are eliminated, stalling policy improvement.
2. FOLR Mechanism: Definition and Implementation
FOLR is built on the insight that the retrieval of the ground-truth answer within the search context is a critical latent signal for progress, even if not ultimately synthesized in the final response. The mechanism operates as follows:
- For each trajectory , define the first-occurence turn , where is the feedback at turn .
- Assign turn-level rewards by:
with typically set to $1$.
This assigns nonzero reward to intermediate steps demonstrating evidence acquisition for the ground-truth answer, even when synthesis is incorrect or incomplete.
3. Resolution of Reward Homogenization
Process-level disambiguation: Under FOLR, trajectories where the agent successfully retrieves the correct answer (i.e., —near misses) are distinguished from complete failures (). Reward is present for turns up to the retrieval, ensuring process credit.
Intra-group advantage variance: Within a rollout group where all final answers are incorrect (“all-wrong”), some trajectories may have retrieved relevant evidence (partial progress) while others have not. The per-turn standard deviation among becomes nonzero, producing nontrivial advantages and enabling effective gradient computation and learning.
4. Optimization Objective and Policy Training
TSPO, leveraging FOLR, employs a PPO-style surrogate loss with group and turn-level normalization:
where
is the importance ratio, and is the normalized advantage at turn for rollout , computed from FOLR rewards. Training alternates between trajectory sampling, reward/advantage computation via FOLR, and clipped PPO updates. The group normalization can be targeted at “all-wrong” groups or all groups; the former yields computational and convergence efficiency.
5. Theoretical Insights
FOLR increases reward variance within rollout groups, breaking the degeneracy () that nullifies gradient signals in outcome-only regimes. By crediting retrieval milestones, FOLR injects process-aligned signals, yielding empirically:
- Higher policy entropy (delayed collapse)
- Lower KL drift from initialization
- Smoother, less erratic gradient norms
Formal convergence follows from standard PPO guarantees, contingent on nondegeneracy of the advantage estimates (Ma et al., 30 Jan 2026).
6. Empirical Performance and Analysis
Extensive benchmarking across seven QA datasets (including NaturalQuestions, TriviaQA, HotpotQA) on Qwen2.5-3B and Qwen2.5-7B-Instruct LLMs demonstrates that TSPO, with FOLR, achieves marked performance improvements over search-augmented RL baselines:
| Model | FOLR/TSPO EM | Best Baseline EM | Relative Gain |
|---|---|---|---|
| Qwen2.5-3B | 0.403 | 0.325 | +24.0% |
| Qwen2.5-7B | 0.444 | 0.385 | +13.6% |
These gains persist across in-domain and out-of-domain splits. Additional analyses show that FOLR resolves the stagnation of reward learning in all-wrong groups, facilitates faster and more stable training convergence, and enables more concise generated explanations without sacrificing answer fidelity.
7. Limitations and Future Directions
FOLR presumes the ground-truth answer must appear in retrieved evidence—a requirement that does not generalize to settings solvable by pure synthesis or when correct retrieval is not feasible. Its application is thus principally suited to multi-turn search-augmented reasoning. Notably, FOLR requires neither reward models nor human annotations.
Potential areas for extension include dynamic selection or weighting of latent process signals, adaptation to synthetic/code-based RLHF pipelines, and integration with multi-agent or curriculum learning schedules. Adaptive tuning of the partial reward coefficient or generalization to scenarios where process-level progress signals are domain-specific, rather than strictly tied to retrieval, are plausible future research avenues (Ma et al., 30 Jan 2026).