In-the-Flow Agentic System Optimization for Effective Planning and Tool Use
(2510.05592v1)
Published 7 Oct 2025 in cs.AI, cs.CL, cs.LG, and cs.MA
Abstract: Outcome-driven reinforcement learning has advanced reasoning in LLMs, but prevailing tool-augmented approaches train a single, monolithic policy that interleaves thoughts and tool calls under full context; this scales poorly with long horizons and diverse tools and generalizes weakly to new scenarios. Agentic systems offer a promising alternative by decomposing work across specialized modules, yet most remain training-free or rely on offline training decoupled from the live dynamics of multi-turn interaction. We introduce AgentFlow, a trainable, in-the-flow agentic framework that coordinates four modules (planner, executor, verifier, generator) through an evolving memory and directly optimizes its planner inside the multi-turn loop. To train on-policy in live environments, we propose Flow-based Group Refined Policy Optimization (Flow-GRPO), which tackles long-horizon, sparse-reward credit assignment by converting multi-turn optimization into a sequence of tractable single-turn policy updates. It broadcasts a single, verifiable trajectory-level outcome to every turn to align local planner decisions with global success and stabilizes learning with group-normalized advantages. Across ten benchmarks, AgentFlow with a 7B-scale backbone outperforms top-performing baselines with average accuracy gains of 14.9% on search, 14.0% on agentic, 14.5% on mathematical, and 4.1% on scientific tasks, even surpassing larger proprietary models like GPT-4o. Further analyses confirm the benefits of in-the-flow optimization, showing improved planning, enhanced tool-calling reliability, and positive scaling with model size and reasoning turns.
Summary
The paper presents AgentFlow, which decomposes reasoning into planner, executor, verifier, and generator to enable robust multi-turn planning.
It introduces Flow-GRPO, an RL algorithm that transforms multi-turn credit assignment into tractable single-turn updates.
Empirical results demonstrate significant accuracy gains and improved tool-calling reliability across diverse benchmarks.
In-the-Flow Agentic System Optimization for Effective Planning and Tool Use
Introduction
The paper introduces AgentFlow, a trainable agentic system designed to address the limitations of monolithic tool-integrated reasoning models in LLMs. Traditional approaches interleave reasoning and tool calls within a single policy, which scales poorly with long-horizon tasks and diverse toolsets, and generalizes weakly to new scenarios. AgentFlow decomposes reasoning across four specialized modules—planner, executor, verifier, and generator—coordinated via an evolving memory. The planner is optimized on-policy inside the multi-turn loop, enabling adaptive, long-horizon planning and robust tool orchestration. The core innovation is Flow-based Group Refined Policy Optimization (Flow-GRPO), an RL algorithm that converts multi-turn optimization into tractable single-turn policy updates by broadcasting a single, verifiable trajectory-level outcome to every turn and stabilizing learning with group-normalized advantages.
AgentFlow Architecture and System Design
AgentFlow formalizes multi-turn, tool-integrated reasoning as a Markov Decision Process (MDP). The system operates as follows: given a query q and toolset K, the planner P generates sub-goals and selects tools, the executor E invokes the tool, the verifier V checks the result, and the generator G produces the final solution. The evolving memory M records the reasoning process, supporting transparent state tracking and bounded context growth.
Figure 1: AgentFlow architecture: four modules interact via shared memory and toolset, enabling in-the-flow planning and tool use.
This modular decomposition enables explicit credit assignment and controllable behavior, overcoming the scaling and generalization issues of monolithic models. The planner is the only trainable component, allowing for efficient adaptation to dynamic tool outputs and recovery from early mistakes.
Flow-GRPO addresses the long-horizon, sparse-reward credit assignment problem by broadcasting a single trajectory-level reward to all turns in a rollout. This transforms multi-turn RL into a sequence of single-turn policy updates, aligning local planner decisions with global success. The objective is:
where Ait is the group-normalized advantage, and ρi,jt is the token-level importance ratio. This design ensures stable optimization and robust credit assignment.
Figure 2: Flow-GRPO optimization: multi-turn RL is reduced to single-turn updates via trajectory-level reward broadcasting.
Empirical Results and Analysis
AgentFlow was evaluated on ten benchmarks spanning search-intensive, agentic, mathematical, and scientific reasoning tasks. All modules used Qwen2.5-7B-Instruct as the backbone, with only the planner trained via Flow-GRPO. AgentFlow achieved average accuracy gains of 14.9% on search, 14.0% on agentic, 14.5% on mathematical, and 4.1% on scientific tasks, outperforming top-performing baselines and even larger proprietary models like GPT-4o.
Figure 3: Left: AgentFlow performance before and after Flow-GRPO tuning. Right: Consistent gains over baselines, including larger proprietary models.
Ablation studies revealed that in-the-flow RL is crucial; offline supervised fine-tuning led to performance collapse, while Flow-GRPO yielded a 17.2% average gain over the frozen baseline. Scaling studies demonstrated consistent improvements with increased backbone size and turn budgets.
Figure 4: Tool scaling paper: AgentFlow's performance improves when tools are upgraded from Qwen2.5-7B-Instruct to GPT-4o.
Tool Usage and Planning Quality
Flow-GRPO fine-tuning led to adaptive tool selection and enhanced tool-calling reliability. For example, on 2Wiki, Google Search usage increased by 42.0%, while on MedQA, the planner shifted towards specialized tools. Tool-calling error rates decreased by up to 28.4% on GAIA, indicating improved invocation accuracy.
Figure 5: Tool call ratio change by Flow-GRPO fine-tuning.
Qualitative case studies showed that AgentFlow, trained with Flow-GRPO, autonomously discovers new solution pathways and exhibits robust self-correction.
Figure 6: Case paper: AgentFlow explores new solution pathways after failed attempts, demonstrating adaptive planning.
Training Efficiency and Scaling
AgentFlow's training dynamics show monotonic reward improvement and response length condensation, indicating efficient policy learning. Compared to monolithic tool-integrated RL baselines, AgentFlow achieves sustained performance gains and avoids training instability.
Scaling experiments confirm that Flow-GRPO fine-tuning is effective across model capacities and that increased turn budgets enable deeper reasoning and improved outcomes.
Implementation Considerations
AgentFlow's modular design facilitates extensibility and transparent state management. The evolving memory enables deterministic tracking of reasoning steps, supporting reproducibility and debugging. Training requires on-policy rollouts with synchronous tool execution and LLM-based reward judging. Resource requirements include multi-GPU clusters for efficient RL fine-tuning, and inference benefits from bounded context growth due to memory structuring.
Deployment strategies should consider tool latency, error handling, and dynamic adaptation to evolving toolsets. The framework is compatible with open-source and proprietary LLMs, and benefits from scaling both model size and tool quality.
Implications and Future Directions
AgentFlow demonstrates that trainable agentic systems with in-the-flow optimization can surpass monolithic models in complex, long-horizon reasoning tasks. The Flow-GRPO algorithm provides a principled solution to sparse-reward credit assignment, enabling robust planning and tool orchestration. The results suggest that modular agentic architectures, combined with outcome-driven RL, are a promising direction for scalable, generalizable reasoning in LLMs.
Future work should explore extending in-the-flow optimization to other modules, incorporating fine-grained reward signals, and scaling to more complex, open-ended tasks. Integrating richer toolsets, multi-agent collaboration, and hierarchical planning are potential avenues for further advancement.
Conclusion
AgentFlow represents a significant step towards scalable, adaptive agentic systems for effective planning and tool use in LLMs. By decomposing reasoning across specialized modules and optimizing the planner in-the-flow via Flow-GRPO, the system achieves strong cross-domain performance, improved planning quality, and reliable tool integration. The framework's modularity, training efficiency, and positive scaling trends position it as a robust foundation for future research in agentic LLMs and multi-tool reasoning.