Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 95 tok/s
Gemini 2.5 Pro 46 tok/s Pro
GPT-5 Medium 38 tok/s
GPT-5 High 33 tok/s Pro
GPT-4o 92 tok/s
GPT OSS 120B 469 tok/s Pro
Kimi K2 205 tok/s Pro
2000 character limit reached

Synthetic Data Generation & Multi-Step RL for Reasoning & Tool Use (2504.04736v2)

Published 7 Apr 2025 in cs.AI, cs.CL, and cs.LG

Abstract: Reinforcement learning has been shown to improve the performance of LLMs. However, traditional approaches like RLHF or RLAIF treat the problem as single-step. As focus shifts toward more complex reasoning and agentic tasks, LLMs must take multiple steps of text generation, reasoning and environment interaction before generating a solution. We propose a synthetic data generation and RL methodology targeting multi-step optimization scenarios. This approach, called Step-Wise Reinforcement Learning (SWiRL), iteratively generates multi-step reasoning and tool use data, and then learns from that data. It employs a simple step-wise decomposition that breaks each multi-step trajectory into multiple sub-trajectories corresponding to each action by the original model. It then applies synthetic data filtering and RL optimization on these sub-trajectories. We evaluated SWiRL on a number of multi-step tool use, question answering, and mathematical reasoning tasks. Our experiments show that SWiRL outperforms baseline approaches by 21.5%, 12.3%, 14.8%, 11.1%, and 15.3% in relative accuracy on GSM8K, HotPotQA, CofCA, MuSiQue, and BeerQA, respectively. Excitingly, the approach exhibits generalization across tasks: for example, training only on HotPotQA (text question-answering) improves zero-shot performance on GSM8K (a math dataset) by a relative 16.9%.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces SWiRL, a method that generates synthetic multi-step trajectories for improved complex reasoning and tool use.
  • It demonstrates performance improvements of 11.1% to 21.5% on tasks like GSM8K and HotPotQA, highlighting strong cross-task generalization.
  • The study emphasizes the importance of process-based data filtering and scalability in model training, enhancing robustness even without tool access.

Synthetic Data Generation & Multi-Step RL for Reasoning & Tool Use

Introduction

Step-Wise Reinforcement Learning (SWiRL) introduces a novel methodology for multi-step optimization, targeting complex reasoning tasks that require text generation and environmental interaction. Traditional RL approaches, such as RLHF and RLAIF, often treat these tasks as single-step processes, which are not conducive to tasks requiring multiple steps of reasoning and tool usage. SWiRL addresses this by generating multi-step synthetic trajectories and applying step-wise reinforcement learning on these data, allowing for effective decomposition of complex problems and improved performance across a variety of tasks.

Methodology

The SWiRL strategy is divided into two stages. Stage 1 involves generating multi-step synthetic data, where a model, optionally with access to tools such as search engines or calculators, creates multi-step reasoning trajectories. Figure 1

Figure 1: In SWiRL Stage 1, multi-step synthetic trajectories are generated, enabling the use of a chain of thought, tools, and end-answer synthesis.

Each step in a trajectory consists of actions and tool calls, evaluated by a model judge for rationale correctness. Stage 2 utilizes these trajectories to fine-tune the base model via step-wise reinforcement learning, optimizing actions individually with model-based feedback. Figure 2

Figure 2: SWiRL Stage 2 uses step-wise RL for each synthetic trajectory, improving multi-step learning through granular feedback on actions.

Experiments

Experiments demonstrate SWiRL's superior performance over baseline models in several tasks, such as GSM8K, HotPotQA, and others, with improvements ranging from 11.1% to 21.5% relative accuracy. Particularly noteworthy is the transfer ability across different tasks: training exclusively on HotPotQA led to a 16.9% improvement in zero-shot performance on GSM8K.

Impact of Data Filtering

SWiRL's efficacy relies critically on process-based data filtering, which selects trajectories judged reasonable at each step by a model-based process reward model. This category outperforms unfiltered and outcome-filtered setups, highlighting the importance of step-wise soundness in training data. Figure 3

Figure 3: Filtering strategies reveal process-filtered data significantly enhances model performance.

Generalization Across Tasks

SWiRL shows marked generalization capabilities, where training on datasets such as HotPotQA enhances performance on disparate datasets like GSM8K, demonstrating an improved ability to manage multi-step reasoning and tool usage.

Comparison with Supervised Fine-Tuning

SWiRL demonstrates significant advantages over SFT, showing higher accuracy and robustness, due to its ability to generalize and adapt to various data filtering strategies and not rely heavily on final outcome correctness. Figure 4

Figure 4: Comparing SFT and SWiRL, the latter excels due to its process-filtered data learning.

Impact of Tool Use

Tool usage at inference time under SWiRL guidance shows remarkable improvement in tackling complex queries more efficiently, although the trained model also exhibits strong performance even without tool access. Figure 5

Figure 5: SWiRL boost in performance with and without tool use highlights its decomposition capacity in complex problems.

Dataset and Model Size Scalability

Performance increases with dataset size, suggesting the scalability of SWiRL in handling larger synthetic datasets for effective learning. Model size experiments highlight that larger models benefit more significantly from SWiRL's multi-step learning optimization. Figure 6

Figure 6: Successive dataset scaling results in consistent performance improvements.

Conclusion

SWiRL's innovative approach to multi-step reasoning and tool use demonstrates significant advantages in complex task optimization. Through synthetic data generation and reward model-guided optimization, it effectively surpasses traditional RL techniques in accuracy and generalization. Figure 7

Figure 7: The effectiveness of SWiRL escalates with model size, attaining the most consistent improvements with larger models.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Youtube Logo Streamline Icon: https://streamlinehq.com

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube