A Simple "Try Again" Can Elicit Multi-Turn LLM Reasoning

Published 18 Jul 2025 in cs.LG and cs.AI | (2507.14295v2)

Abstract: Multi-turn problem solving is critical yet challenging for Large Reasoning Models (LRMs) to reflect on their reasoning and revise from feedback. Existing Reinforcement Learning (RL) methods train large reasoning models on a single-turn paradigm with verifiable rewards. However, we observe that models trained with existing RL paradigms often lose their ability to solve problems across multiple turns and struggle to revise answers based on contextual feedback, leading to repetitive responses. We ask: can LRMs learn to reflect their answers in a multi-turn context? In this work, we find that training models with multi-turn RL using only unary feedback (e.g., "Let's try again") after wrong answers can improve both single-turn performance and multi-turn reasoning. We introduce Unary Feedback as Observation (UFO) for reinforcement learning, which uses minimal yet common unary user feedback during iterative problem solving. It can be easily applied to existing single-turn RL training setups. Experimental results show that RL training with UFO keeps single-turn performance and improves multi-turn reasoning accuracy by up to 14%, enabling LLMs to better react to feedback in multi-turn problem solving. To further minimize the number of turns needed for a correct answer while encouraging diverse reasoning when mistakes occur, we design reward structures that guide models to produce careful and deliberate answers in each turn. Code: https://github.com/lichengliu03/unary-feedback

Abstract PDF Upgrade to Chat

Authors (8)

Summary

The paper introduces UFO, a framework that transforms single-turn RL into effective multi-turn reasoning by leveraging minimal unary feedback.
It employs Markov Decision Processes and Proximal Policy Optimization, integrating reward decay and repetition penalties to improve response diversity.
Experiments demonstrate up to a 14% accuracy improvement and enhanced generalization across diverse tasks using multi-turn training.

A Simple "Try Again" Can Elicit Multi-Turn LLM Reasoning

Introduction

The paper "A Simple 'Try Again' Can Elicit Multi-Turn LLM Reasoning" addresses the challenge of enabling large reasoning models (LRMs) to engage in multi-turn reasoning and adapt their responses based on feedback. Single-turn reinforcement learning (RL) methods have demonstrated efficacy in enhancing the reasoning capabilities of LLMs but often falter in multi-turn interactive tasks due to repetitive and static response behaviors (Figure 1).

Figure 1: Single-turn RL causes LLMs to repeat the same answer across turns instead of revising based on feedback.

To address this gap, the authors introduce Unary Feedback as Observation (UFO), a framework that utilizes minimal feedback signals to train models for multi-turn reasoning tasks. UFO leverages simple unary feedback, such as generic prompts like "Let's try again," to encourage iterative exploration and adaptation within existing RL setups.

UAV Framework and Implementation

The UFO framework models multi-turn interactions as Markov Decision Processes (MDP), capturing the interaction history at each step and applying unary feedback to transform single-turn datasets into multi-turn sessions. At its core, UFO employs reinforcement learning to optimize multi-turn policies using Proximal Policy Optimization (PPO) while imposing reward decay and repetition penalties to enhance reasoning efficiency and diversity (Figure 2).

Figure 2: The UFO framework for multi-turn training. At each step $t$ , the model observes the full interaction history and generates a response. Correct responses receive discounted rewards $\gamma^t$ , while incorrect ones receive none.

To implement this framework:

State Construction: Concatenate interaction history as a prompt, with the unary feedback acting as a response modifier.
Policy Optimization: Adopt PPO, which uses a learned critic for fine-grained value assessments over multi-turn episodes.
Reward Structuring: Define rewards that decay exponentially over turns, encouraging minimal turn count, and apply a penalty for repeated answers to foster diversity.
Training: Engage in batch training using rollouts per prompt, with models evaluating their success based on structured multi-turn datasets.

Experimental Results

The experiments reveal that multi-turn RL via UFO effectively bolsters interactive reasoning, with agents achieving up to a 14% improvement in accuracy for multi-turn reasoning compared to single-turn RL. Additionally, multi-turn-trained models demonstrate robust generalization across diverse task domains and effectively adapt to distinct evaluation setups, including single-turn scenarios (Figure 3).

Figure 3: Validation performance (Succ@k) of models trained with different roll-out turns under varying inference-time turn budgets. Multi-turn training (5 or 10 turns) consistently yields higher success rates across all inference turn budgets.

Limitations and Future Directions

While the UFO framework advances multi-turn reasoning capabilities, it is primarily validated on smaller model scales, which may limit its scalability. Further exploration of larger models remains an avenue for future research. Moreover, enhancing reward structure to better align with complex reasoning processes could address drifts in reasoning integrity observable in multi-turn trajectories (Figure 4).

Figure 4: Performance across different evaluation round settings. Each subplot shows the success rate evaluated at r rounds.

Conclusions

The paper presents a significant step toward bridging the gap between single-turn and multi-turn reasoning capabilities in LRMs. By leveraging minimal feedback within familiar RL paradigms, UFO offers a practical and lightweight approach to improve iterative reasoning without requiring extensive structural changes. This work points to a promising direction for practical applications where limited feedback can still yield deep and adaptive model reasoning.

Markdown Report Issue