AgentFly: RL-Driven LLM Agent Framework

Updated 25 August 2025

AgentFly is a reinforcement learning framework that integrates token-level RL masking, decorator extensibility, and memory-augmented MDPs to enable continual, multi-turn skill acquisition.
It decouples agent-environment interactions from RL optimization, supporting asynchronous execution and efficient resource management for high-throughput tasks.
AgentFly’s open-source design offers modular tools and environments for real-world applications such as webshop navigation and scientific problem solving.

AgentFly refers to a family of reinforcement learning (RL) frameworks and agent models designed to empower LLM (LM/LLM) agents with adaptive, scalable, and extensible capabilities for complex tasks. These frameworks address limitations in prompt engineering and supervised fine-tuning by systematically integrating RL—particularly in multi-turn, tool-using, and open-ended task settings—while offering efficient mechanisms for continual learning and adaptation. Key architectural contributions include token-level RL masking, decorator-based extensibility for tool and reward definitions, asynchronous resource management for scalable environment interaction, and memory-augmented Markov Decision Process (M-MDP) formalisms for continual skill acquisition without LLM fine-tuning.

1. Architectural Overview

AgentFly was introduced to support extensible and scalable RL for LM agents operating in diverse environments and interfaces (Wang et al., 20 Jul 2025). The framework builds upon Verl (a baseline RL platform), introducing a separate agent module responsible for managing rollouts, multi-turn action generation, and seamless integration with tool APIs. The design decouples agent-environment interactions from RL optimization logic, enabling state-of-the-art RL algorithms—including PPO, REINFORCE++, GRPO, and RLOO—to be applied with minimal modifications.

AgentFly formalizes multi-turn agent–environment interaction as token-level trajectories, where language outputs, tool calls, and environment observations may all be interleaved:

$T = (\text{prompt}, (r_1, o_1), (r_2, o_2), \ldots)$

AgentFly’s masking strategy ensures that only LM-generated tokens (responses $r_1$ , $r_2$ , etc.) contribute to the RL objective, while tool or environment-injected tokens are ignored in loss computation.

2. Token-Level Reinforcement Learning and Multi-Turn Support

Standard RL for LLMs requires careful handling of which segments of an episode should influence learning. AgentFly introduces a token-level masking technique:

$L_{\text{PPO}}(\theta) = \sum_t f(a_t, s_t, \hat{A}_t)\cdot M_t$

where $M_t=1$ if $a_t$ is a token generated by the LM, and $M_t=0$ otherwise. In multi-turn settings, this mask aggregates over all LM-generated responses, ensuring that reward signals backpropagate only through the agent’s own output, not through tokens generated by the environment or by tool outputs.

This approach enables AgentFly to support rich, interactive task structures where agents may need to reason, plan, execute multi-step tool use, and interact with simulated or real-world APIs.

3. Extensibility: Decorator-Based Tool and Reward Definition

AgentFly employs a decorator-based programming interface, facilitating modular addition of tools, environments, and reward functions in Python. This design allows users to annotate ordinary Python functions as tools or rewards:

1
2
3

@tool(name="calculator")
def calculate(expression: str):
    return eval(expression)

For stateful or more complex tools (e.g., code interpreters, WebShop, ALFWorld), additional metadata (such as environment class and resource pool size) is specified in decorators. The uniform interface schema ensures consistent access and manipulation by the agent, while greatly reducing developer burden for extensibility and rapid prototyping.

4. Scalability: Asynchronous Execution and Resource Management

To address bottlenecks in high-throughput RL, particularly during I/O-heavy tool invocations and complex simulators, AgentFly provides:

Asynchronous execution chains: Each query may initiate multiple rollout chains, enabling pipeline throughput to maximize available computational and I/O resources. This mechanism maintains non-blocking execution even with thousands of parallel tool calls.
Centralized resource management: Stateful tools/environments are managed in resource pools. When a rollout chain invokes a tool, AgentFly allocates an instance from the pool, binds it to the chain, and recycles it upon task completion. This permits elastic scaling across heterogeneous compute and simulation backends.

5. Prebuilt Tools, Environments, and Task Diversity

AgentFly includes a suite of prebuilt tools and environments demonstrating the flexibility of the framework:

Tool/Environment	Description	Key Use Case
Code Interpreter	Secure Python execution in a container	Mathematical problem solving
Search & Retrieve	API wrappers for internet search (e.g., Google, Wiki)	Factual question answering
WebShop	Simulated e-commerce site, DOM interaction API	Sequential navigation, planning
ALFWorld	Text-based embodied environment interface	Multi-step action planning
ScienceWorld	Grade-school science sandbox, requires experimentation	Reasoning, multi-turn tool usage

These environments allow systematic evaluation of agent reasoning, planning, exploration, and tool-use competencies under RL algorithms.

6. Memory-Augmented Markov Decision Process (M-MDP) and Continual Adaptation

The "Fine-tuning LLM Agents without Fine-tuning LLMs" paradigm (Zhou et al., 22 Aug 2025) extends AgentFly to memory-based online RL, formalizing the agent’s operation as an M-MDP. In this setting, the agent augments state $s$ and action $a$ spaces with a case memory $M$ , storing episodic experiences $(s, a, r)$ .

At each decision point, a neural case-selection policy $\mu(c|s, M)$ retrieves salient past cases to condition the LLM’s output:

$\pi(a|s, M) = \sum_{c \in M} \mu(c|s, M) \cdot p_{\text{LLM}}(a|s,c)$

A soft Q-learning criterion is adopted for updating the case-selection policy:

$\mu^*(c|s, M) = \frac{\exp(Q^*(s, M, c)/\alpha)}{\sum_{c' \in M} \exp(Q^*(s, M, c')/\alpha)}$

where $\alpha$ is an entropy temperature hyperparameter.

Case selection can be non-parametric (e.g., cosine similarity) or parametric (neural Q-function), enabling efficient retrieval, adaptation, and avoidance of gradient-based fine-tuning on the base LLM. This mechanism supports continual and low-cost agent adaptation in open-ended settings (e.g., DeepResearcher, GAIA).

7. Empirical Performance and Open-Source Implementation

AgentFly achieves strong empirical results on demanding benchmarks:

On GAIA validation, top-1 performance: $87.88\%$ Pass@$3$; test: $79.40\%$ .
On DeepResearcher: $66.6\%$ F1, $80.4\%$ Partial Match (PM).
Case-based memory confers $4.7\%$ – $9.6\%$ absolute improvement on out-of-distribution tasks.

AgentFly is distributed as open-source software with complete documentation and integration recipes:

https://github.com/Agent-on-the-Fly/AgentFly

The codebase provides templates for tool/environment definitions, RL method customization, and distributed training.

8. Significance and Outlook

AgentFly establishes a rigorous, modular, and efficient standard for RL-based LLM agents. By divorcing skill acquisition from direct LLM parameter updates and leveraging extensible interfaces, asynchronous scaling, and memory-augmented decision-theoretic policies, AgentFly represents a step towards generalist, continuously-improving AI agents. This framework enables rapid experimentation on reasoning, planning, and open-ended research tasks, establishing a testbed and deployment mechanism for next-generation agent capabilities.