Papers
Topics
Authors
Recent
Search
2000 character limit reached

Multi-Round Scalar Reward Prompting

Updated 25 January 2026
  • The paper demonstrates that multi-round scalar reward prompting iteratively refines language model outputs using numerical feedback to enhance performance.
  • It uses both explicit RL (GRPO) and in-context learning strategies to optimize response correctness, coherence, and reasoning accuracy.
  • Empirical results show substantial gains in multi-hop QA, math reasoning, creative writing, and other tasks compared to traditional methods.

Multi-round scalar reward prompting refers to an inference or training-time protocol wherein a LLM (LM) or a prompting agent engages in a multi-turn interaction, iteratively generating responses or prompts, and receiving scalar (numerical) rewards after each attempt or sequence. This feedback mechanism is used to enable learning or improvement—either via explicit reinforcement learning (RL) or through context-driven adaptation—such that subsequent prompts or generations increasingly maximize a predefined objective, typically response correctness, coherence, generation quality, or reasoning accuracy. Recent research advances notably encompass (1) in-context RL paradigms for static, fixed LMs and (2) agented prompting schemes with trainable policies and explicit RL objectives, each leveraging scalar rewards for optimizing complex prompting strategies (Liu et al., 2 Nov 2025, Song et al., 21 May 2025).

1. Formal Foundation and Problem Definition

In multi-round scalar reward prompting, each episode consists of a sequence of rounds, indexed by tt. At round tt, the system state StS_t encodes the interaction so far; the agent (which may be a trainable LM or a fixed LM) selects an action AtA_t (a response, sub-prompt, or sequence), and an external process or environment returns a scalar reward Rt+1R_{t+1} reflecting the quality or utility of the response.

  • In "Prompt-R1," the process is formulated as a Markov decision process (MDP) with composite state representation Ht=[(a1prompt,r1prompt),...,(atprompt,rtprompt)]H_t = [(a_1^\text{prompt}, r_1^\text{prompt}), ..., (a_t^\text{prompt}, r_t^\text{prompt})] and an agent internal state Ft=St(Ft1,atthink,atprompt,rtprompt)F_t = S_t(F_{t-1}, a_t^{\text{think}}, a_t^\text{prompt}, r_t^\text{prompt}). Actions comprise reasoning decisions and prompt generations, with transitions mediated by large-scale LLM responses (Liu et al., 2 Nov 2025).
  • In "Reward Is Enough," the state StS_t comprises the core task specification, all past attempts, and associated rewards; actions AtA_t correspond to new candidate outputs; rewards rt+1r_{t+1} are externally supplied scalars (Song et al., 21 May 2025).

The joint objective across both settings is to maximize the cumulative scalar reward, either explicitly via policy optimization (end-to-end RL) or implicitly through context-driven selection.

2. Reward Mechanisms and Design

Central to the effectiveness of multi-round scalar reward prompting is the reward signal. The reward can be structured to reflect properties such as answer correctness, response format compliance, or fine-grained stepwise progress:

  • "Prompt-R1" uses a dual-constrained scalar reward composed of a format reward Rfmt(τ)R_\text{fmt}(\tau), which enforces well-formed reasoning and completion, and an answer correctness reward Rans(τ)R_\text{ans}(\tau), based on maximal token-level F1-score against references. Rewards are combined via a gated mechanism to ensure format satisfaction is prioritized before correctness (Liu et al., 2 Nov 2025).

R(τ)={ϵ+Rfmt(τ)+Rans(τ),if Rfmt(τ)=ϵ ϵ+Rfmt(τ),otherwiseR(\tau) = \begin{cases} -\epsilon + R_\text{fmt}(\tau) + R_\text{ans}(\tau), & \text{if } R_\text{fmt}(\tau) = \epsilon \ -\epsilon + R_\text{fmt}(\tau), & \text{otherwise} \end{cases}

  • In "Reward Is Enough," rewards can be episode-level binary/categorical values (e.g., success/failure), model-generated judgment scores, or dense signals from interactive environments. For instance, arithmetic step rewards in Game of 24 are scored as {0,1,3}\{0, 1, 3\} by a GPT-4.1 judge, creative writing rewards are GPT-assigned coherence scores (1–10), and ScienceWorld rewards are subgoal completions, all made explicit in the prompt as "Reward: \cdot" tokens (Song et al., 21 May 2025).

Proper reward design ensures that the prompting loop meaningfully drives the agent (or LM) toward improved reasoning and generation.

3. Multi-Round Prompting Protocols

The multi-round protocol entails iterative generation, feedback, and prompt accumulation:

  • In the RL-agent paradigm ("Prompt-R1"), a small-scale LLM agent (e.g., Qwen3-4B with LoRA adapters) observes the full prompt history, generates a reasoning trace and prompt, receives a structured response from a large-scale LLM (e.g., GPT-4o-mini or GPT-OSS-20B), and the loop continues until maximal turns or early termination. The final output is judged and the trajectory is used for policy update (Liu et al., 2 Nov 2025).
  • In the in-context RL (ICRL) paradigm ("Reward Is Enough"), a frozen LLM receives, at each round, a prompt concatenating the static task description, all past attempts and rewards, and a meta-instruction (e.g., "Try again and improve your answer"). The model generates a new attempt, is given the numerical reward, and the process repeats for KK rounds (typically up to the context window limit) (Song et al., 21 May 2025).

The guiding principle is a monotonic context expansion, treating the prompt as an explicit experience buffer for the model.

4. Learning and Optimization

Learning mechanisms in multi-round scalar reward prompting bifurcate into explicit RL (with parameter updates) and implicit in-context learning (fixed parameters):

  • "Prompt-R1" employs Group-Relative Policy Optimization (GRPO), a novel PPO variant that normalizes trajectory-level rewards within disjoint groups (M=16M=16). The policy π\pi is updated by minimizing:

LGRPO=1Mi=1M[t=1TiA^ilogπ(ai,tHi,t1,q)]+ζDKL(πoldπnew)L_\text{GRPO} = -\frac{1}{M}\sum_{i=1}^{M}\left[ \sum_{t=1}^{T_i} \hat A_i \log \pi(a_{i,t}|H_{i,t-1}, q) \right] + \zeta\, D_{KL}(\pi_\text{old} \| \pi_\text{new})

where advantage A^i\hat A_i is standardized per-group, and ζ\zeta modulates policy divergence (Liu et al., 2 Nov 2025).

  • In the ICRL prompting setting, improvement arises not from parameter updates but from the LLM’s emergent mechanism to attend and adapt to past high-reward attempts; the model uses scalar rewards as in-context signals for policy improvement, without gradient-based updates (Song et al., 21 May 2025).

A striking empirical finding is that numerical scalar rewards—even when noisy or LLM-derived—enable robust improvement, consistent with the reward-hypothesis of RL (Song et al., 21 May 2025).

5. Architectural and Implementation Considerations

Key implementation choices impact the feasibility and effectiveness of multi-round scalar reward prompting:

Component Prompt-R1 (Liu et al., 2 Nov 2025) ICRL Prompting (Song et al., 21 May 2025)
Agent Qwen3-4B (LoRA adapters) Frozen pretrained LLM (e.g., GPT-4.1)
Environment GPT-4o-mini, GPT-OSS-20B Task context + buffer
Reward Integration Dual-constrained (format, correctness) Scalar numerical, per round
Update Mechanism Explicit RL (GRPO, PPO-like, 3 epochs) In-context RL, no updates
Prompt Structure (Think, Prompt, Response, Reward) tuples [Task description; Attempts + Rewards; Meta-instruction]
Buffer Size Up to 2,048 tokens, T=8 rounds Up to LLM context window, K=50 rounds

Distinctive practices include capping format rewards, structured prompt composition, and the option for locally deployed LLMs (for "zero-cost" cycles).

6. Empirical Performance and Evaluation

Empirical benchmarks demonstrate the effectiveness of multi-round scalar reward prompting on a variety of tasks spanning multi-hop QA, math reasoning, creative writing, science-based interactive environments, and summarization:

  • "Prompt-R1" consistently yields large gains over SFT, CoT, and state-of-the-art automatic prompting optimization methods,
    • Multi-hop QA F1: 17.854.4%17.8 \to 54.4\% (HotpotQA, +8.1)
    • GSM8K math EM/F1: 97.7%97.7\% (vs 92.97%92.97\% CoT, +4.7)
    • Creative writing SSim: 22.1%22.1\% (vs 12.2%12.2\%, +9.9)
    • Out-of-distribution F1 improvement: +4.55+4.55 points (Liu et al., 2 Nov 2025).
  • ICRL prompting ("Reward Is Enough") achieves:
    • Game of 24 (ICRL Preset): 90%90\% success (vs 49%49\% Best-of-N)
    • Creative writing: 86%86\% win rate vs. Self-Refine, 93%93\% vs. Best-of-N
    • ScienceWorld (mean return): 88±0.788 \pm 0.7 (vs 83±0.983 \pm 0.9 Self-Refine), robust to LLM-generated scalar rewards (Song et al., 21 May 2025).

Performance degrades substantially if rewards are omitted, the episode buffer is truncated, or rewards are replaced with natural-language critiques, confirming the critical role of explicit scalar feedback.

7. Practical Recommendations and Theoretical Implications

Best practices for multi-round scalar reward prompting emphasize minimal yet explicit encoding of numerical feedback, full retention of all responses and rewards (learning from failure), buffer expansion to context limits, and the use of concise meta-instructions to direct exploration vs exploitation.

A principal theoretical implication is the demonstration that scalar reward signals suffices for forward-pass meta-RL in large LMs; numerical rewards are parsimoniously interpreted, and even self-generated reward signals can induce non-trivial policy improvement in fixed-parameter LMs (Song et al., 21 May 2025). The strong gains obtained by scalar-reward-driven multi-round prompting further substantiate the RL reward-hypothesis and highlight the utility of plug-and-play prompting agents for complex reasoning in large-scale LM inference pipelines (Liu et al., 2 Nov 2025, Song et al., 21 May 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multi-Round Scalar Reward Prompting.