Papers
Topics
Authors
Recent
2000 character limit reached

Youtu-Agent: Scaling Agent Productivity with Automated Generation and Hybrid Policy Optimization (2512.24615v1)

Published 31 Dec 2025 in cs.AI

Abstract: Existing LLM agent frameworks face two significant challenges: high configuration costs and static capabilities. Building a high-quality agent often requires extensive manual effort in tool integration and prompt engineering, while deployed agents struggle to adapt to dynamic environments without expensive fine-tuning. To address these issues, we propose \textbf{Youtu-Agent}, a modular framework designed for the automated generation and continuous evolution of LLM agents. Youtu-Agent features a structured configuration system that decouples execution environments, toolkits, and context management, enabling flexible reuse and automated synthesis. We introduce two generation paradigms: a \textbf{Workflow} mode for standard tasks and a \textbf{Meta-Agent} mode for complex, non-standard requirements, capable of automatically generating tool code, prompts, and configurations. Furthermore, Youtu-Agent establishes a hybrid policy optimization system: (1) an \textbf{Agent Practice} module that enables agents to accumulate experience and improve performance through in-context optimization without parameter updates; and (2) an \textbf{Agent RL} module that integrates with distributed training frameworks to enable scalable and stable reinforcement learning of any Youtu-Agents in an end-to-end, large-scale manner. Experiments demonstrate that Youtu-Agent achieves state-of-the-art performance on WebWalkerQA (71.47\%) and GAIA (72.8\%) using open-weight models. Our automated generation pipeline achieves over 81\% tool synthesis success rate, while the Practice module improves performance on AIME 2024/2025 by +2.7\% and +5.4\% respectively. Moreover, our Agent RL training achieves 40\% speedup with steady performance improvement on 7B LLMs, enhancing coding/reasoning and searching capabilities respectively up to 35\% and 21\% on Maths and general/multi-hop QA benchmarks.

Summary

  • The paper demonstrates that decoupling agent construction from tool and environment configuration enables rapid automated synthesis and modular reusability.
  • It introduces a hybrid policy optimization approach that combines training-free GRPO with distributed reinforcement learning for scalable and efficient performance improvements.
  • Empirical results on web navigation and mathematical reasoning benchmarks confirm superior effectiveness and reproducibility using open-source models and tools.

Youtu-Agent: A Modular Framework for Automated Generation and Hybrid Policy Optimization of LLM Agents

Introduction

Youtu-Agent addresses major limitations in the current LLM agent paradigm by introducing a unified, modular framework for both automated agent generation and continuous policy optimization. The paper rigorously demonstrates how decoupling environments, tools, and context management facilitates not only programmatic agent construction but also rapid, stable optimization via hybrid reinforcement learning and training-free experiential approaches. The architecture is validated on rigorous web navigation and mathematical reasoning tasks, establishing strong empirical results using only open-source models and tools.

Modular System Architecture

Youtu-Agent’s three-layered architecture—environment, tools, and agent—is parameterized through a structured YAML configuration system. This modularization supports both manual configuration and automated synthesis, enabling rapid iteration and flexible reuse of components across varied execution contexts.

  • Environment Layer: Abstracts execution backends (e.g., Playwright, shell, E2B sandbox), allowing agents to operate seamlessly across web, OS, and code environments.
  • Tools Layer: Standardizes atomic operations—including environment-specific, environment-independent, and MCP-integrated tools—enabling composability and reuse.
  • Agent Layer: Encapsulates an LLM planner/executor with integrated context management that episodically prunes and maintains a compact working context, supporting long-horizon, dynamic interactions.

Critically, this architecture enables the separation and recombination of capabilities, facilitating systematic management, automated construction, and scalable optimization, all within a unified control schema.

Automated Agent Generation

The framework introduces two distinct paradigms for automated agent synthesis from high-level task descriptions:

  • Workflow Mode: Implements a deterministic four-stage pipeline (intent clarification, tool retrieval/synthesis, prompt engineering, configuration assembly) for routine, well-specified tasks.
  • Meta-Agent Mode: Deploys an LLM-based architect agent, augmenting conventional workflow logic with multi-turn dialogue and adaptive tool synthesis, optimizing for ambiguous or complex user instructions. Figure 1

    Figure 1: The automated agent generation mechanism, supporting both deterministic workflow and architect-driven meta-agent modes.

Empirical evaluation on the AgentGen-80 suite reveals a configuration validity rate exceeding 98%, and tool synthesis executability exceeding 81% across both paradigms. Meta-Agent mode demonstrates marginally higher end-to-end task completion while maintaining the flexibility necessary for under-specified or edge-case problem domains.

Hybrid Policy Optimization Approaches

Youtu-Agent innovates on policy optimization by supporting both non-gradient, experience-based in-context improvements and full-scale, distributed reinforcement learning. This duality ensures agents remain both adaptive and scalable.

Training-Free Group Relative Policy Optimization

The Agent Practice module integrates a training-free, group-relative policy optimization (GRPO) approach. Instead of backpropagation, it utilizes multi-rollout, LLM-based trajectory evaluations to synthesize semantic advantage signals. Experiential knowledge, distilled as textual LoRA, is then injected into the context during deployment to guide agent reasoning. Figure 2

Figure 2: Training-free GRPO accumulates and distills experience from trajectory samples without model weight updates.

Evaluation on the AIME 2024/2025 mathematical reasoning benchmarks demonstrates absolute mean@32 gains of +2.7% and +5.4% on DeepSeek-V3.1-Terminus, rivaling RL methods that require orders of magnitude more data and compute. Notably, this improvement is obtained with only 100 examples and no model parameter updates, underscoring the practical accessibility of this approach. Figure 3

Figure 3: During Training-free GRPO, both solution quality and tool utilization efficiency improve with accumulated experience.

Scalable and Stable Reinforcement Learning

For end-to-end agent policy training, the Agent RL module integrates tightly with distributed RL frameworks through RESTful API connectors, Ray-based parallelism, and multi-level timeout handling to ensure both scalability (128 GPU scaling, 40% iteration speedup) and resilience to training pathologies such as entropy explosion. Figure 4

Figure 4: End-to-end RL training pipeline in Youtu-Agent, showing data flow through distributed RL and agent-inference systems.

Algorithmic adjustments—ablation of batch shuffling, filtered tool call data, corrected advantage estimation—stabilize PPO dynamics for long-horizon, tool-based reasoning tasks. Figure 5

Figure 5

Figure 5: Agent RL module achieves significant per-iteration and rollout generation speedup compared to baseline RL frameworks.

Figure 6

Figure 6

Figure 6

Figure 6

Figure 6

Figure 6

Figure 6: Stable KL divergence during PPO, indicating robust convergence characteristics following the framework’s optimizations.

On AIME and multi-hop QA benchmarks, Qwen2.5-7B-Instruct demonstrates accuracy gains of +35%, +22% (AIME), and up to +21% (NaturalQuestions), confirming both the effectiveness and generality of the reinforcement learning integration.

Empirical Performance and Applications

Youtu-Agent delivers robust empirical performance on two prominent LLM-agent benchmarks: 71.47% pass@1 on WebWalkerQA and 72.8% pass@1 on the GAIA text-only subset. These results are achieved with strictly open-source models and tools, validating the accessibility and scalability of the system. Figure 7

Figure 7: Youtu-Agent outperforms baseline and prior approaches on WebWalkerQA, with both training-free and RL-enhanced agent variants.

Beyond research, Youtu-Agent is packaged within Tip, a desktop multimodal agent, enabling local, secure, GUI-automating agents that load YAML-based configurations and support proactive, intent-driven user interaction. Figure 8

Figure 8: The Tip application demonstrates on-device, practical deployment leveraging Youtu-Agent’s modular config system and skills automation.

Theoretical and Practical Implications

The Youtu-Agent framework makes several substantive contributions:

  • Separation of specification and instantiation: By decoupling agent logic from tool/environment implementation, agents can be programmatically generated, versioned, and optimized with minimal human intervention.
  • Bridging inference-time and policy-optimization improvement: Experience-driven, parameter-free optimization (GRPO) can be integrated with and augment full RL pipelines, opening the door to data-efficient, adaptive agent evolution without the requisite computational overhead.
  • Scalability and reproducibility: Systematic configuration and support for distributed training facilitate transparent experimentation and real-world deployment.

These advances imply that both the cost and complexity barriers to high-quality, adaptive agent design and training can be dramatically reduced. The hybrid approach to optimization unifies the advantages of LLMs’ contextual learning with conventional policy gradient methods, offering new avenues for sample-efficient, robust agent policy learning.

Future Directions

Future developments include extending Youtu-Agent to support additional execution environments, further integrating multi-agent collaboration, and exploring advanced strategies for experience accumulation—potentially incorporating memory-augmented LLMs and agentic meta-learning frameworks. The practical deployment of on-device personal agents, as realized in Tip, aligns with trends toward user-controlled, privacy-preserving AI.

Conclusion

Youtu-Agent unifies modular agent construction, automated synthesis, and scalable hybrid policy optimization in a single open-source framework (2512.24615). The system demonstrates strong empirical results across demanding web navigation and mathematical reasoning tasks, achieves superior training efficiency, and supports both programmatic agent generation and inference-time experiential improvement. This architecture establishes an adaptable, efficient foundation for future LLM-agent research and deployment, bridging the gap between static, manually-configured agents and continuously-evolving, scalable agentic systems.

Whiteboard

Paper to Video (Beta)

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 4 tweets with 30 likes about this paper.