Dice Question Streamline Icon: https://streamlinehq.com

Numerical reward feedback vs. verbal feedback for inference-time LLM self-improvement

Establish whether numerical scalar reward feedback, used within the proposed in-context reinforcement learning (ICRL) prompting framework for Large Language Models, is a competitive alternative to verbal textual feedback for inference-time self-improvement across tasks, by demonstrating comparable or superior performance when only scalar rewards are provided in the growing interaction context.

Information Square Streamline Icon: https://streamlinehq.com

Background

The paper introduces an in-context reinforcement learning (ICRL) prompting framework where an LLM generates responses over multiple episodes and receives only numerical scalar rewards, which are appended to the context to facilitate inference-time learning. This contrasts with multi-round self-improvement methods such as Self-Refine and Reflexion that use verbal textual feedback (reflections or critiques) rather than scalar rewards.

Across benchmarks (Game of 24, creative writing, and ScienceWorld), the authors observe strong improvements using scalar rewards appended to the context. They argue that textual self-verification can suffer from hallucinations and instability, motivating the question of whether numerical feedback alone can serve as an effective alternative for guiding inference-time improvement in LLMs.

References

Given the proven success of numerical feedbacks in RL and the demonstrated performance improvement over baselines in our experiments, we conjecture that the numerical feedback might be a competing alternative for the verbal feedback.

Reward Is Enough: LLMs Are In-Context Reinforcement Learners (2506.06303 - Song et al., 21 May 2025) in Conclusion