Learning from Failures in Multi-Attempt Reinforcement Learning (2503.04808v1)

Published 4 Mar 2025 in cs.CL, cs.AI, and cs.LG

Abstract: Recent advancements in reinforcement learning (RL) for LLMs, exemplified by DeepSeek R1, have shown that even a simple question-answering task can substantially improve an LLM's reasoning capabilities. In this work, we extend this approach by modifying the task into a multi-attempt setting. Instead of generating a single response per question, the model is given multiple attempts, with feedback provided after incorrect responses. The multi-attempt task encourages the model to refine its previous attempts and improve search efficiency. Experimental results show that even a small LLM trained on a multi-attempt task achieves significantly higher accuracy when evaluated with more attempts, improving from 45.6% with 1 attempt to 52.5% with 2 attempts on the math benchmark. In contrast, the same LLM trained on a standard single-turn task exhibits only a marginal improvement, increasing from 42.3% to 43.2% when given more attempts during evaluation. The results indicate that, compared to the standard single-turn task, an LLM trained on a multi-attempt task achieves slightly better performance on math benchmarks while also learning to refine its responses more effectively based on user feedback. Full code is available at https://github.com/DualityRL/multi-attempt

Authors (3)

Stephen Chung (14 papers)
Wenyu Du (21 papers)
Jie Fu (229 papers)

Summary

Exploring Multi-Attempt Reinforcement Learning for LLMs

The paper "Learning from Failures in Multi-Attempt Reinforcement Learning" by Stephen Chung, Wenyu Du, and Jie Fu presents an innovative approach to augmenting reinforcement learning (RL) for LLMs by introducing a multi-attempt framework. This paper builds upon existing methodologies aimed at enhancing reasoning capabilities of LLMs, such as those demonstrated by DeepSeek R1, by allowing models to iteratively refine their responses based on feedback from incorrect attempts.

Context and Motivation

Traditional RL applications to LLMs typically rely on single-turn tasks, where models receive sparse rewards contingent on the correctness of their initial output. While effective to some extent, this approach limits the ability of LLMs to improve and adapt their outputs based on interaction with an environment or user feedback. The authors propose a shift to multi-attempt tasks, wherein models can generate multiple responses to a prompt, receiving feedback after each attempt that allows subsequent refinements. This framework intends to foster more nuanced signal learning, thereby enhancing iterative reasoning capabilities and user feedback responsiveness.

Methodology

The proposed multi-attempt framework modifies the standard RL environment into a multi-turn task setting:

Task Design: LLMs are allowed multiple attempts to answer a given question. If a response is incorrect, the model receives feedback and can try again until the allocated number of attempts is exhausted. Feedback is formulated to indicate incorrect formats or wrong answers, thereby encouraging models to learn not merely the solutions but the process of refining their reasoning.
Reward System: Rewards are distributed as +1 for a correct answer, -0.5 for an incorrect response presented in correct format, and -1 otherwise. This configuration encourages the models to learn self-refinement by penalizing repeated errors across attempts.
Training Algorithm: Standard Proximal Policy Optimization (PPO) is employed to optimize the revised task structure.

Results

The experimental results are revealing, indicating significant performance enhancements in the multi-attempt LLM compared to its single-turn counterpart, especially in refining incorrect answers. On math benchmarks, the accuracy improved notably from 45.6% to 52.5% when moving from one attempt to two, suggesting models effectively learn to refine their answers based on prior failed attempts. This iterative improvement is more pronounced than what is observed under the single-turn framework, where model performance displayed marginal enhancements. Additionally, even under single-attempt evaluation conditions, LLMs trained with multi-attempt tasks exhibited slight improvements over baseline single-turn trained models.

Implications and Future Directions

These findings have broad implications, suggesting that multi-attempt tasks can be a catalyst for significant enhancements in reasoning and adaptive learning for LLMs. Practically, this can improve LLM applications in domains requiring complex decision-making, such as advanced tutoring systems or interactive instructional design.

Theoretically, the multi-attempt framework supports the emergence of self-refinement capabilities similar to the Aha Moment identified in DeepSeek R1 Zero. This capability can be pivotal in evolving towards more autonomous AI systems capable of complex problem-solving.

Future developments in this arena may involve integrating varied feedback mechanisms and exploring auxiliary learning tasks further to nurture emergent capabilities beneficial to LLM applications. This paper points toward multi-turn settings as potentially fertile grounds for advancing AI beyond static task contexts, making iterative refinement a native and efficient practice among LLMs.

In summary, this paper elucidates a promising direction in utilizing multi-attempt reinforcement learning settings to foster nuanced reasoning capabilities in LLMs, demonstrating tangible benefits to adaptive learning and problem-solving. This research invites exploration into deeper integration and application of multi-attempt scenarios, with the potential to further advance the capabilities and applications of AI.

Related Papers

Find Related Papers

GitHub

GitHub - DualityRL/multi-attempt (6 stars)

Tweets

https://twitter.com/_akhaliq/status/1898939713110540507

https://twitter.com/llmsresearch/status/1902516191609917730

https://twitter.com/GptMaestro/status/1900152196249096282

https://twitter.com/tmramalho/status/1900019026484457560