Prompt-R1: Collaborative Automatic Prompting Framework via End-to-end Reinforcement Learning (2511.01016v1)

Published 2 Nov 2025 in cs.CL

Abstract: Recently, advanced LLMs have emerged at an increasingly rapid pace. However, when faced with complex problems, most users are often unable to provide accurate and effective prompts to interact with LLMs, thus limiting the performance of LLMs. To address this challenge, we propose Prompt-R1, an end-to-end reinforcement learning framework that uses a small-scale LLM to collaborate with large-scale LLMs, replacing user interaction to solve problems better. This collaboration is cast as a multi-turn prompt interaction, where the small-scale LLM thinks and generates prompts, and the large-scale LLM performs complex reasoning. A dual-constrained reward is designed to optimize for correctness, generation quality, and reasoning accuracy. Prompt-R1 provides a plug-and-play framework that supports both inference and training with various large-scale LLMs. Experiments on multiple public datasets show that Prompt-R1 significantly outperforms baseline models across tasks. Our code is publicly available at https://github.com/QwenQKing/Prompt-R1.

Summary

The paper introduces Prompt-R1, which employs a dual-constrained RL framework for iterative prompt optimization to achieve improved reasoning outputs.
Its plug-and-play architecture enables a small-scale LLM to collaboratively generate and refine prompts through multi-turn interactions with a larger LLM.
Experimental results demonstrate superior performance over traditional methods, boosting prompt correctness and reasoning accuracy on diverse tasks.

Prompt-R1: Collaborative Automatic Prompting Framework via End-to-end Reinforcement Learning

Introduction

The paper introduces Prompt-R1, an automatic prompting framework based on end-to-end reinforcement learning (RL) designed to improve the interaction between small and large-scale LLMs. Prompt-R1 addresses key limitations faced by users of LLMs, particularly in generating accurate and effective prompts for complex reasoning tasks. By leveraging a collaborative multi-turn prompt interaction, Prompt-R1 enables a small LLM to intelligently craft prompts that lead the larger LLM through complex reasoning steps. A dual-constrained reward mechanism optimizes prompt correctness, quality, and reasoning accuracy.

Framework Overview

Prompt-R1 adopts a plug-and-play architecture, supporting effortless integration with various large-scale LLMs. It features a small-scale LLM operating as an agent that generates prompts in a multi-turn interaction framework. This agent refines prompts by thinking iteratively and enhancing reasoning through continuous interactions with the large-scale LLM (acting as the environment), which evaluates and computes complex responses.

The reward system in Prompt-R1 is dual-constrained, focusing on format compliance and answer correctness. Utilization of RL allows the small-scale LLM to learn policies that generate effective prompts, optimizing the large-scale LLM's output quality over multiple turns.

Implementation Details

Multi-Turn Prompt Interaction

The interaction proceeds in a multi-round format, where the agent (small-scale LLM) generates prompts based on the current task and historical interaction context. The response from the large-scale LLM updates this historical state, informing subsequent rounds. Each step involves reasoning by the agent to generate a suitable prompt, leveraging the environment’s feedback to refine its approach. The process terminates upon meeting a predefined stopping criterion, with the final prompt informing the solution formulation.

Reinforcement Learning Optimization

RL plays a pivotal role in optimizing the interaction process. The small-scale LLM's policy is trained using a GRPO (Group Relative Policy Optimization) approach. A double-constrained reward structure enforces strategic generation of prompts, facilitating the alignment of responses with intended semantic and structural accuracy. This framework is trained and evaluated using both zero-cost local deployments and computationally intensive online APIs, demonstrating scalable adaptability.

Experimental Results

Prompt-R1 showcases significant performance improvements across diverse datasets, including multi-hop reasoning, mathematical problem-solving, and text generation tasks. It surpasses state-of-the-art baselines—such as Chain-of-Thought (CoT) reasoning, supervised fine-tuning (SFT), and other prompting optimizations—affording a superior understanding of complex semantic tasks.

Ablation and Generalization Studies

Ablation studies illustrate the critical contributions of the integrated RL mechanism and collaborative environment to Prompt-R1’s success. Prompt-R1 also exhibits strong generalization capabilities, achieving robust results on out-of-distribution datasets. Additionally, its ability to work seamlessly across different LLM environments underscores its versatility and efficiency.

Practical Implications

Prompt-R1’s architecture offers several substantial benefits. It enhances LLM utility for users across AI-driven domains, particularly in contexts demanding nuanced reasoning and prompt utilization. Its plug-and-play nature supports task-specific customizations without extensive LLM re-training. Furthermore, its framework can be expanded for applications like knowledge-based assistance systems, adaptive user interfaces, and complex decision support systems.

Conclusion

Prompt-R1 contributes a robust framework for automatizing prompt formulation through collaborative multi-turn interactions between small- and large-scale LLMs. The innovative use of reinforcement learning and dual-constrained rewards marks a substantial advancement in automatic prompt optimization. This enables more efficient leveraging of LLM capabilities, promoting broader applicability across complex reasoning and generation tasks. Potential future enhancements could focus on optimizing the framework's scalability, efficiency, and adaptability to even more diverse and dynamic task environments.