Papers
Topics
Authors
Recent
Search
2000 character limit reached

First-Occurrence Latent Reward (FOLR)

Updated 6 February 2026
  • The paper introduces FOLR, a mechanism assigning turn-level rewards at the precise moment the ground-truth is first retrieved, thereby crediting partial progress in multi-turn reasoning.
  • FOLR mitigates process and intra-group homogenization by providing nonzero rewards for intermediate retrievals, which enhances the variance needed for effective advantage estimation.
  • Empirical results on datasets like NaturalQuestions and TriviaQA show that TSPO with FOLR improves exact match scores by 13–24% over baselines, leading to more stable and effective policy updates.

The First-Occurrence Latent Reward (FOLR) is a turn-level reward assignment mechanism introduced within Turn-level Stage-aware Policy Optimization (TSPO) to address inefficiencies and optimization barriers encountered in multi-turn tool-augmented reasoning. FOLR’s central principle is the allocation of explicit reward at the process step where the ground-truth answer is first retrieved during iterative search, providing partial credit for demonstrated progress. This sectioned policy signal increases intra-group reward variance and resolves issues of homogenization endemic to conventional outcome-only reinforcement learning frameworks in this domain (Ma et al., 30 Jan 2026).

1. Problem Setting and Rationale

In multi-turn search-augmented generation tasks, an agent (often a LLM) interacts with external tools (e.g., search engines) to iteratively retrieve evidence and synthesize answers. The standard RL formalism defines the environment as a Markov Decision Process (MDP) with:

  • State sks_k capturing dialogue history and accumulated tool outputs up to turn kk.
  • Action aka_k as either a tool query or generative response.
  • Transition dynamics updating states based on tool feedback fkf_k.
  • Reward R\mathcal{R} usually given sparsely: only the final output is compared to the ground-truth, assigning r=1r = 1 if the model’s answer is correct and r=0r = 0 otherwise; all intermediate steps are rewardless.

Such outcome-level rewards entirely ignore intermediate reasoning and retrieval progress, leading to two homogenization phenomena:

  • Process homogenization: All forms of procedural progress are erased when the final answer is incorrect, even if the model successfully retrieved relevant evidence.
  • Intra-group homogenization: Group-based RL methods (e.g., Group Relative Policy Optimization, GRPO) compute normalized advantages within rollout groups. If all return zero, as with uniform failure under outcome-level rewards, advantage gradients are eliminated, stalling policy improvement.

2. FOLR Mechanism: Definition and Implementation

FOLR is built on the insight that the retrieval of the ground-truth answer within the search context is a critical latent signal for progress, even if not ultimately synthesized in the final response. The mechanism operates as follows:

  • For each trajectory ii, define the first-occurence turn ti=min{kTi:agoldfi,k}t^*_i = \min\{k \leq T_i : a_{\text{gold}} \in f_{i,k} \}, where fi,kf_{i,k} is the feedback at turn kk.
  • Assign turn-level rewards ri,kr_{i,k} by:

ri,k={1,if ai=agold(full success) α,if aiagoldkti(partial credit up to first occurrence) 0,otherwise.r_{i,k} = \begin{cases} 1, & \text{if } a_i = a_{\text{gold}} \quad (\text{full success}) \ \alpha, & \text{if } a_i \neq a_{\text{gold}} \wedge k \le t_i^* \quad (\text{partial credit up to first occurrence}) \ 0, & \text{otherwise.} \end{cases}

with α(0,1]\alpha \in (0, 1] typically set to $1$.

This assigns nonzero reward to intermediate steps demonstrating evidence acquisition for the ground-truth answer, even when synthesis is incorrect or incomplete.

3. Resolution of Reward Homogenization

Process-level disambiguation: Under FOLR, trajectories where the agent successfully retrieves the correct answer (i.e., O/P+O^{-}/P^{+}—near misses) are distinguished from complete failures (O/PO^{-}/P^{-}). Reward is present for turns up to the retrieval, ensuring process credit.

Intra-group advantage variance: Within a rollout group where all final answers are incorrect (“all-wrong”), some trajectories may have retrieved relevant evidence (partial progress) while others have not. The per-turn standard deviation σk\sigma_k among ri,kr_{i,k} becomes nonzero, producing nontrivial advantages A^i,k\hat{A}_{i,k} and enabling effective gradient computation and learning.

4. Optimization Objective and Policy Training

TSPO, leveraging FOLR, employs a PPO-style surrogate loss with group and turn-level normalization:

JTSPO(θ)=Ex,τ1...Gπθ[1Gi=1Gk=1TiLi,kβDKL(πθπref)]J_{\mathrm{TSPO}}(\theta) = \mathbb{E}_{x, \tau_{1...G} \sim \pi_\theta} \Biggl[\frac{1}{G} \sum_{i=1}^G \sum_{k=1}^{T_i} \mathcal{L}_{i,k} - \beta D_{\mathrm{KL}}(\pi_\theta \| \pi_{\mathrm{ref}})\Biggr]

where

Li,k=min(wi,kA^i,k,clip(wi,k,1 ⁣± ⁣ϵ)A^i,k),\mathcal{L}_{i,k} = \min\Big( w_{i,k}\hat{A}_{i,k}, \mathrm{clip}\left(w_{i,k}, 1\!\pm\!\epsilon\right) \hat{A}_{i,k} \Big),

wi,kw_{i,k} is the importance ratio, and A^i,k\hat{A}_{i,k} is the normalized advantage at turn kk for rollout ii, computed from FOLR rewards. Training alternates between trajectory sampling, reward/advantage computation via FOLR, and clipped PPO updates. The group normalization can be targeted at “all-wrong” groups or all groups; the former yields computational and convergence efficiency.

5. Theoretical Insights

FOLR increases reward variance within rollout groups, breaking the degeneracy (σ=0\sigma=0) that nullifies gradient signals in outcome-only regimes. By crediting retrieval milestones, FOLR injects process-aligned signals, yielding empirically:

  • Higher policy entropy (delayed collapse)
  • Lower KL drift from initialization
  • Smoother, less erratic gradient norms

Formal convergence follows from standard PPO guarantees, contingent on nondegeneracy of the advantage estimates (Ma et al., 30 Jan 2026).

6. Empirical Performance and Analysis

Extensive benchmarking across seven QA datasets (including NaturalQuestions, TriviaQA, HotpotQA) on Qwen2.5-3B and Qwen2.5-7B-Instruct LLMs demonstrates that TSPO, with FOLR, achieves marked performance improvements over search-augmented RL baselines:

Model FOLR/TSPO EM Best Baseline EM Relative Gain
Qwen2.5-3B 0.403 0.325 +24.0%
Qwen2.5-7B 0.444 0.385 +13.6%

These gains persist across in-domain and out-of-domain splits. Additional analyses show that FOLR resolves the stagnation of reward learning in all-wrong groups, facilitates faster and more stable training convergence, and enables more concise generated explanations without sacrificing answer fidelity.

7. Limitations and Future Directions

FOLR presumes the ground-truth answer must appear in retrieved evidence—a requirement that does not generalize to settings solvable by pure synthesis or when correct retrieval is not feasible. Its application is thus principally suited to multi-turn search-augmented reasoning. Notably, FOLR requires neither reward models nor human annotations.

Potential areas for extension include dynamic selection or weighting of latent process signals, adaptation to synthetic/code-based RLHF pipelines, and integration with multi-agent or curriculum learning schedules. Adaptive tuning of the partial reward coefficient α\alpha or generalization to scenarios where process-level progress signals are domain-specific, rather than strictly tied to retrieval, are plausible future research avenues (Ma et al., 30 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to First-Occurrence Latent Reward (FOLR).