- The paper demonstrates that decoupling agent construction from tool and environment configuration enables rapid automated synthesis and modular reusability.
- It introduces a hybrid policy optimization approach that combines training-free GRPO with distributed reinforcement learning for scalable and efficient performance improvements.
- Empirical results on web navigation and mathematical reasoning benchmarks confirm superior effectiveness and reproducibility using open-source models and tools.
Youtu-Agent: A Modular Framework for Automated Generation and Hybrid Policy Optimization of LLM Agents
Introduction
Youtu-Agent addresses major limitations in the current LLM agent paradigm by introducing a unified, modular framework for both automated agent generation and continuous policy optimization. The paper rigorously demonstrates how decoupling environments, tools, and context management facilitates not only programmatic agent construction but also rapid, stable optimization via hybrid reinforcement learning and training-free experiential approaches. The architecture is validated on rigorous web navigation and mathematical reasoning tasks, establishing strong empirical results using only open-source models and tools.
Modular System Architecture
Youtu-Agent’s three-layered architecture—environment, tools, and agent—is parameterized through a structured YAML configuration system. This modularization supports both manual configuration and automated synthesis, enabling rapid iteration and flexible reuse of components across varied execution contexts.
- Environment Layer: Abstracts execution backends (e.g., Playwright, shell, E2B sandbox), allowing agents to operate seamlessly across web, OS, and code environments.
- Tools Layer: Standardizes atomic operations—including environment-specific, environment-independent, and MCP-integrated tools—enabling composability and reuse.
- Agent Layer: Encapsulates an LLM planner/executor with integrated context management that episodically prunes and maintains a compact working context, supporting long-horizon, dynamic interactions.
Critically, this architecture enables the separation and recombination of capabilities, facilitating systematic management, automated construction, and scalable optimization, all within a unified control schema.
Automated Agent Generation
The framework introduces two distinct paradigms for automated agent synthesis from high-level task descriptions:
Empirical evaluation on the AgentGen-80 suite reveals a configuration validity rate exceeding 98%, and tool synthesis executability exceeding 81% across both paradigms. Meta-Agent mode demonstrates marginally higher end-to-end task completion while maintaining the flexibility necessary for under-specified or edge-case problem domains.
Hybrid Policy Optimization Approaches
Youtu-Agent innovates on policy optimization by supporting both non-gradient, experience-based in-context improvements and full-scale, distributed reinforcement learning. This duality ensures agents remain both adaptive and scalable.
Training-Free Group Relative Policy Optimization
The Agent Practice module integrates a training-free, group-relative policy optimization (GRPO) approach. Instead of backpropagation, it utilizes multi-rollout, LLM-based trajectory evaluations to synthesize semantic advantage signals. Experiential knowledge, distilled as textual LoRA, is then injected into the context during deployment to guide agent reasoning.
Figure 2: Training-free GRPO accumulates and distills experience from trajectory samples without model weight updates.
Evaluation on the AIME 2024/2025 mathematical reasoning benchmarks demonstrates absolute mean@32 gains of +2.7% and +5.4% on DeepSeek-V3.1-Terminus, rivaling RL methods that require orders of magnitude more data and compute. Notably, this improvement is obtained with only 100 examples and no model parameter updates, underscoring the practical accessibility of this approach.
Figure 3: During Training-free GRPO, both solution quality and tool utilization efficiency improve with accumulated experience.
Scalable and Stable Reinforcement Learning
For end-to-end agent policy training, the Agent RL module integrates tightly with distributed RL frameworks through RESTful API connectors, Ray-based parallelism, and multi-level timeout handling to ensure both scalability (128 GPU scaling, 40% iteration speedup) and resilience to training pathologies such as entropy explosion.
Figure 4: End-to-end RL training pipeline in Youtu-Agent, showing data flow through distributed RL and agent-inference systems.
Algorithmic adjustments—ablation of batch shuffling, filtered tool call data, corrected advantage estimation—stabilize PPO dynamics for long-horizon, tool-based reasoning tasks.

Figure 5: Agent RL module achieves significant per-iteration and rollout generation speedup compared to baseline RL frameworks.




Figure 6: Stable KL divergence during PPO, indicating robust convergence characteristics following the framework’s optimizations.
On AIME and multi-hop QA benchmarks, Qwen2.5-7B-Instruct demonstrates accuracy gains of +35%, +22% (AIME), and up to +21% (NaturalQuestions), confirming both the effectiveness and generality of the reinforcement learning integration.
Youtu-Agent delivers robust empirical performance on two prominent LLM-agent benchmarks: 71.47% pass@1 on WebWalkerQA and 72.8% pass@1 on the GAIA text-only subset. These results are achieved with strictly open-source models and tools, validating the accessibility and scalability of the system.
Figure 7: Youtu-Agent outperforms baseline and prior approaches on WebWalkerQA, with both training-free and RL-enhanced agent variants.
Beyond research, Youtu-Agent is packaged within Tip, a desktop multimodal agent, enabling local, secure, GUI-automating agents that load YAML-based configurations and support proactive, intent-driven user interaction.
Figure 8: The Tip application demonstrates on-device, practical deployment leveraging Youtu-Agent’s modular config system and skills automation.
Theoretical and Practical Implications
The Youtu-Agent framework makes several substantive contributions:
- Separation of specification and instantiation: By decoupling agent logic from tool/environment implementation, agents can be programmatically generated, versioned, and optimized with minimal human intervention.
- Bridging inference-time and policy-optimization improvement: Experience-driven, parameter-free optimization (GRPO) can be integrated with and augment full RL pipelines, opening the door to data-efficient, adaptive agent evolution without the requisite computational overhead.
- Scalability and reproducibility: Systematic configuration and support for distributed training facilitate transparent experimentation and real-world deployment.
These advances imply that both the cost and complexity barriers to high-quality, adaptive agent design and training can be dramatically reduced. The hybrid approach to optimization unifies the advantages of LLMs’ contextual learning with conventional policy gradient methods, offering new avenues for sample-efficient, robust agent policy learning.
Future Directions
Future developments include extending Youtu-Agent to support additional execution environments, further integrating multi-agent collaboration, and exploring advanced strategies for experience accumulation—potentially incorporating memory-augmented LLMs and agentic meta-learning frameworks. The practical deployment of on-device personal agents, as realized in Tip, aligns with trends toward user-controlled, privacy-preserving AI.
Conclusion
Youtu-Agent unifies modular agent construction, automated synthesis, and scalable hybrid policy optimization in a single open-source framework (2512.24615). The system demonstrates strong empirical results across demanding web navigation and mathematical reasoning tasks, achieves superior training efficiency, and supports both programmatic agent generation and inference-time experiential improvement. This architecture establishes an adaptable, efficient foundation for future LLM-agent research and deployment, bridging the gap between static, manually-configured agents and continuously-evolving, scalable agentic systems.