Papers

Topics

Authors

Recent

View all

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 86 tok/s

Gemini 2.5 Pro 56 tok/s Pro

GPT-5 Medium 31 tok/s Pro

GPT-5 High 33 tok/s Pro

GPT-4o 102 tok/s Pro

Kimi K2 202 tok/s Pro

GPT OSS 120B 467 tok/s Pro

Claude Sonnet 4 37 tok/s Pro

2000 character limit reached

Parallel-R1: Towards Parallel Thinking via Reinforcement Learning (2509.07980v1)

Published 9 Sep 2025 in cs.CL

Abstract: Parallel thinking has emerged as a novel approach for enhancing the reasoning capabilities of LLMs by exploring multiple reasoning paths concurrently. However, activating such capabilities through training remains challenging, as existing methods predominantly rely on supervised fine-tuning (SFT) over synthetic data, which encourages teacher-forced imitation rather than exploration and generalization. Different from them, we propose \textbf{Parallel-R1}, the first reinforcement learning (RL) framework that enables parallel thinking behaviors for complex real-world reasoning tasks. Our framework employs a progressive curriculum that explicitly addresses the cold-start problem in training parallel thinking with RL. We first use SFT on prompt-generated trajectories from easier tasks to instill the parallel thinking ability, then transition to RL to explore and generalize this skill on harder problems. Experiments on various math benchmarks, including MATH, AMC23, and AIME, show that Parallel-R1 successfully instills parallel thinking, leading to 8.4% accuracy improvements over the sequential thinking model trained directly on challenging tasks with RL. Further analysis reveals a clear shift in the model's thinking behavior: at an early stage, it uses parallel thinking as an exploration strategy, while in a later stage, it uses the same capability for multi-perspective verification. Most significantly, we validate parallel thinking as a \textbf{mid-training exploration scaffold}, where this temporary exploratory phase unlocks a higher performance ceiling after RL, yielding a 42.9% improvement over the baseline on AIME25. Our model, data, and code will be open-source at https://github.com/zhengkid/Parallel-R1.

Summary

The paper introduces a novel RL framework that trains LLMs to employ parallel thinking through a progressive curriculum and dual-stage rewards.
The methodology integrates cold-start SFT on simple math problems with RL on challenging benchmarks, achieving an 83.7% success rate in parallel-thinking traces and notable accuracy improvements.
The study demonstrates that alternating reward strategies guide LLMs from exploratory multi-threading to verified reasoning, yielding up to 42.9% accuracy gains over baselines.

Reinforcement Learning for Parallel Thinking in LLMs: The Parallel-R1 Framework

Introduction

Parallel thinking, the ability to concurrently explore multiple reasoning paths, has recently been recognized as a critical capability for advanced mathematical reasoning in LLMs. The Parallel-R1 framework introduces a reinforcement learning (RL) approach to instill parallel thinking in LLMs, moving beyond the limitations of supervised fine-tuning (SFT) on synthetic data. The framework leverages a progressive curriculum and novel reward designs to address the cold-start problem and optimize both reasoning accuracy and parallel thinking behaviors.

Figure 1: Overview of the Parallel-R1 framework, illustrating the inference workflow and progressive multi-stage training for parallel thinking.

Parallel Thinking: Formalization and Inference Workflow

Parallel thinking in LLMs is operationalized via two stages: (1) exploration, where the model detects critical steps and launches multiple independent reasoning threads, and (2) summary, where the model aggregates and synthesizes insights from these threads before resuming the main reasoning chain. This is implemented using control tags: <Parallel>, <Path>, and <Summary>, which structure the generation process and enable adaptive invocation of parallel reasoning.

During inference, the model generates responses auto-regressively until a <Parallel> tag is emitted, at which point multiple threads are spawned. Each thread explores a distinct solution path, and their outputs are summarized and merged back into the main context. This cycle can repeat as needed, allowing the model to flexibly balance exploration and verification.

Data Pipeline and Cold-Start Strategy

A key challenge in training parallel thinking is the scarcity of high-quality parallel reasoning traces in natural text, especially for complex problems. The framework addresses this by generating cold-start data on simple math problems (GSM8K) using detailed prompting, achieving an 83.7% success rate in valid parallel-thinking traces. This data is used to teach the model the format and basic behavior of parallel thinking before transitioning to RL on more difficult tasks.

A format-check algorithm ensures strict adherence to the parallel thinking structure, which is essential for structured model variants that rely on architectural modifications (e.g., path-window attention masks and multiverse position encodings).

Reinforcement Learning Algorithms and Training Recipes

The RL phase employs Group Relative Policy Optimization (GRPO), which optimizes the policy using group-wise advantage estimation and KL regularization. The training recipe consists of three stages:

Cold-Start SFT: Fine-tuning on Parallel-GSM8K to learn the parallel thinking format.
RL on Easy Math: Reinforcement learning on GSM8K with a binary reward that requires both correct answers and the use of parallel thinking.
RL on General Math: RL on challenging datasets (DAPO, AIME, AMC, MATH) with accuracy-based rewards to generalize parallel thinking to complex problems.

For structured models, architectural modifications enforce strict isolation between reasoning paths, and reward schedules are alternated between accuracy and parallel thinking incentives to avoid overfitting to superficial patterns.

Empirical Results and Ablation Studies

Parallel-R1 achieves substantial improvements over baselines. On AIME25, the causal variant (Parallel-R1-Seen) reaches 19.2% mean@16 accuracy, compared to 14.8% for the GRPO baseline. The framework yields an average accuracy improvement of 8.4% over sequential RL models and up to 42.9% improvement on AIME25 when using parallel thinking as a mid-training exploration scaffold.

Ablation studies reveal:

Stage-wise RL: Cold-start SFT followed by RL on easy math is essential for causal models, while structured models require direct RL on hard tasks to avoid overfitting.
Reward Design: Direct accuracy rewards fail to stimulate parallel thinking, while direct parallel rewards harm performance. Alternating rewards achieve a balance, increasing parallel ratio without sacrificing accuracy.
Prompting: Detailed parallel thinking prompts improve model understanding and generalization.

Evolution of Parallel Thinking Behavior

Analysis of training dynamics shows a strategic shift in the use of parallel thinking. Early in RL training, the model employs parallel paths for computational exploration, increasing the likelihood of discovering correct solutions. As training progresses, the model transitions to using parallel thinking for late-stage multi-perspective verification, minimizing risk and maximizing reward.

(Figure 2)

Figure 2: Dynamics of the relative position of the <Parallel> block during RL training, indicating a shift from early exploration to late verification.

Parallel Thinking as a Mid-Training Exploration Scaffold

Parallel thinking serves as an effective structured exploration mechanism during RL training. By enforcing parallel reasoning in the early stages, the model explores a broader policy space, avoiding local optima. Transitioning to accuracy-only rewards in later stages allows the model to exploit the most effective strategies discovered during exploration, resulting in higher final performance.

(Figure 3)

Figure 3: Two-stage training with parallel reasoning as a mid-training exploration scaffold, showing accuracy and parallel ratio trends.

Case Studies

Qualitative analysis demonstrates the model's ability to adaptively invoke parallel thinking for both exploration and verification. Early-stage models use parallel blocks to explore distinct algebraic methods, while late-stage models employ parallel thinking for verification after a primary solution is found.

Implications and Future Directions

The Parallel-R1 framework provides a scalable and effective approach to instilling parallel thinking in LLMs, with strong empirical gains in mathematical reasoning. The strategic evolution of parallel thinking behaviors and the utility of mid-training exploration scaffolds have broad implications for RL-based reasoning in LLMs. Future work may extend these techniques to other domains, investigate more adaptive reward schedules, and explore architectural innovations for more efficient parallel reasoning.

Conclusion

Parallel-R1 demonstrates that reinforcement learning, combined with a progressive curriculum and carefully designed rewards, can successfully teach LLMs to perform parallel thinking on complex reasoning tasks. The framework achieves consistent improvements over sequential baselines, reveals nuanced learning dynamics, and validates parallel thinking as a powerful exploration strategy in RL training. These findings advance the state of the art in reasoning with LLMs and open new avenues for research in adaptive, multi-path reasoning.