Numerical reward feedback vs. verbal feedback for inference-time LLM self-improvement
Establish whether numerical scalar reward feedback, used within the proposed in-context reinforcement learning (ICRL) prompting framework for Large Language Models, is a competitive alternative to verbal textual feedback for inference-time self-improvement across tasks, by demonstrating comparable or superior performance when only scalar rewards are provided in the growing interaction context.
References
Given the proven success of numerical feedbacks in RL and the demonstrated performance improvement over baselines in our experiments, we conjecture that the numerical feedback might be a competing alternative for the verbal feedback.
— Reward Is Enough: LLMs Are In-Context Reinforcement Learners
(2506.06303 - Song et al., 21 May 2025) in Conclusion