Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 52 tok/s
Gemini 2.5 Pro 47 tok/s Pro
GPT-5 Medium 18 tok/s Pro
GPT-5 High 13 tok/s Pro
GPT-4o 100 tok/s Pro
Kimi K2 192 tok/s Pro
GPT OSS 120B 454 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

AgentFly: RL-Driven LLM Agent Framework

Updated 25 August 2025
  • AgentFly is a reinforcement learning framework that integrates token-level RL masking, decorator extensibility, and memory-augmented MDPs to enable continual, multi-turn skill acquisition.
  • It decouples agent-environment interactions from RL optimization, supporting asynchronous execution and efficient resource management for high-throughput tasks.
  • AgentFly’s open-source design offers modular tools and environments for real-world applications such as webshop navigation and scientific problem solving.

AgentFly refers to a family of reinforcement learning (RL) frameworks and agent models designed to empower LLM (LM/LLM) agents with adaptive, scalable, and extensible capabilities for complex tasks. These frameworks address limitations in prompt engineering and supervised fine-tuning by systematically integrating RL—particularly in multi-turn, tool-using, and open-ended task settings—while offering efficient mechanisms for continual learning and adaptation. Key architectural contributions include token-level RL masking, decorator-based extensibility for tool and reward definitions, asynchronous resource management for scalable environment interaction, and memory-augmented Markov Decision Process (M-MDP) formalisms for continual skill acquisition without LLM fine-tuning.

1. Architectural Overview

AgentFly was introduced to support extensible and scalable RL for LM agents operating in diverse environments and interfaces (Wang et al., 20 Jul 2025). The framework builds upon Verl (a baseline RL platform), introducing a separate agent module responsible for managing rollouts, multi-turn action generation, and seamless integration with tool APIs. The design decouples agent-environment interactions from RL optimization logic, enabling state-of-the-art RL algorithms—including PPO, REINFORCE++, GRPO, and RLOO—to be applied with minimal modifications.

AgentFly formalizes multi-turn agent–environment interaction as token-level trajectories, where language outputs, tool calls, and environment observations may all be interleaved:

T=(prompt,(r1,o1),(r2,o2),)T = (\text{prompt}, (r_1, o_1), (r_2, o_2), \ldots)

AgentFly’s masking strategy ensures that only LM-generated tokens (responses r1r_1, r2r_2, etc.) contribute to the RL objective, while tool or environment-injected tokens are ignored in loss computation.

2. Token-Level Reinforcement Learning and Multi-Turn Support

Standard RL for LLMs requires careful handling of which segments of an episode should influence learning. AgentFly introduces a token-level masking technique:

LPPO(θ)=tf(at,st,A^t)MtL_{\text{PPO}}(\theta) = \sum_t f(a_t, s_t, \hat{A}_t)\cdot M_t

where Mt=1M_t=1 if ata_t is a token generated by the LM, and Mt=0M_t=0 otherwise. In multi-turn settings, this mask aggregates over all LM-generated responses, ensuring that reward signals backpropagate only through the agent’s own output, not through tokens generated by the environment or by tool outputs.

This approach enables AgentFly to support rich, interactive task structures where agents may need to reason, plan, execute multi-step tool use, and interact with simulated or real-world APIs.

3. Extensibility: Decorator-Based Tool and Reward Definition

AgentFly employs a decorator-based programming interface, facilitating modular addition of tools, environments, and reward functions in Python. This design allows users to annotate ordinary Python functions as tools or rewards:

1
2
3
@tool(name="calculator")
def calculate(expression: str):
    return eval(expression)

For stateful or more complex tools (e.g., code interpreters, WebShop, ALFWorld), additional metadata (such as environment class and resource pool size) is specified in decorators. The uniform interface schema ensures consistent access and manipulation by the agent, while greatly reducing developer burden for extensibility and rapid prototyping.

4. Scalability: Asynchronous Execution and Resource Management

To address bottlenecks in high-throughput RL, particularly during I/O-heavy tool invocations and complex simulators, AgentFly provides:

  • Asynchronous execution chains: Each query may initiate multiple rollout chains, enabling pipeline throughput to maximize available computational and I/O resources. This mechanism maintains non-blocking execution even with thousands of parallel tool calls.
  • Centralized resource management: Stateful tools/environments are managed in resource pools. When a rollout chain invokes a tool, AgentFly allocates an instance from the pool, binds it to the chain, and recycles it upon task completion. This permits elastic scaling across heterogeneous compute and simulation backends.

5. Prebuilt Tools, Environments, and Task Diversity

AgentFly includes a suite of prebuilt tools and environments demonstrating the flexibility of the framework:

Tool/Environment Description Key Use Case
Code Interpreter Secure Python execution in a container Mathematical problem solving
Search & Retrieve API wrappers for internet search (e.g., Google, Wiki) Factual question answering
WebShop Simulated e-commerce site, DOM interaction API Sequential navigation, planning
ALFWorld Text-based embodied environment interface Multi-step action planning
ScienceWorld Grade-school science sandbox, requires experimentation Reasoning, multi-turn tool usage

These environments allow systematic evaluation of agent reasoning, planning, exploration, and tool-use competencies under RL algorithms.

6. Memory-Augmented Markov Decision Process (M-MDP) and Continual Adaptation

The "Fine-tuning LLM Agents without Fine-tuning LLMs" paradigm (Zhou et al., 22 Aug 2025) extends AgentFly to memory-based online RL, formalizing the agent’s operation as an M-MDP. In this setting, the agent augments state ss and action aa spaces with a case memory MM, storing episodic experiences (s,a,r)(s, a, r).

At each decision point, a neural case-selection policy μ(cs,M)\mu(c|s, M) retrieves salient past cases to condition the LLM’s output:

π(as,M)=cMμ(cs,M)pLLM(as,c)\pi(a|s, M) = \sum_{c \in M} \mu(c|s, M) \cdot p_{\text{LLM}}(a|s,c)

A soft Q-learning criterion is adopted for updating the case-selection policy:

μ(cs,M)=exp(Q(s,M,c)/α)cMexp(Q(s,M,c)/α)\mu^*(c|s, M) = \frac{\exp(Q^*(s, M, c)/\alpha)}{\sum_{c' \in M} \exp(Q^*(s, M, c')/\alpha)}

where α\alpha is an entropy temperature hyperparameter.

Case selection can be non-parametric (e.g., cosine similarity) or parametric (neural Q-function), enabling efficient retrieval, adaptation, and avoidance of gradient-based fine-tuning on the base LLM. This mechanism supports continual and low-cost agent adaptation in open-ended settings (e.g., DeepResearcher, GAIA).

7. Empirical Performance and Open-Source Implementation

AgentFly achieves strong empirical results on demanding benchmarks:

  • On GAIA validation, top-1 performance: 87.88%87.88\% Pass@$3$; test: 79.40%79.40\%.
  • On DeepResearcher: 66.6%66.6\% F1, 80.4%80.4\% Partial Match (PM).
  • Case-based memory confers 4.7%4.7\%9.6%9.6\% absolute improvement on out-of-distribution tasks.

AgentFly is distributed as open-source software with complete documentation and integration recipes:

https://github.com/Agent-on-the-Fly/AgentFly

The codebase provides templates for tool/environment definitions, RL method customization, and distributed training.

8. Significance and Outlook

AgentFly establishes a rigorous, modular, and efficient standard for RL-based LLM agents. By divorcing skill acquisition from direct LLM parameter updates and leveraging extensible interfaces, asynchronous scaling, and memory-augmented decision-theoretic policies, AgentFly represents a step towards generalist, continuously-improving AI agents. This framework enables rapid experimentation on reasoning, planning, and open-ended research tasks, establishing a testbed and deployment mechanism for next-generation agent capabilities.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube