AsyncFlow: An Asynchronous Streaming RL Framework for Efficient LLM Post-Training

Published 2 Jul 2025 in cs.LG and cs.AI | (2507.01663v1)

Abstract: Reinforcement learning (RL) has become a pivotal technology in the post-training phase of LLMs. Traditional task-colocated RL frameworks suffer from significant scalability bottlenecks, while task-separated RL frameworks face challenges in complex dataflows and the corresponding resource idling and workload imbalance. Moreover, most existing frameworks are tightly coupled with LLM training or inference engines, making it difficult to support custom-designed engines. To address these challenges, we propose AsyncFlow, an asynchronous streaming RL framework for efficient post-training. Specifically, we introduce a distributed data storage and transfer module that provides a unified data management and fine-grained scheduling capability in a fully streamed manner. This architecture inherently facilitates automated pipeline overlapping among RL tasks and dynamic load balancing. Moreover, we propose a producer-consumer-based asynchronous workflow engineered to minimize computational idleness by strategically deferring parameter update process within staleness thresholds. Finally, the core capability of AsynFlow is architecturally decoupled from underlying training and inference engines and encapsulated by service-oriented user interfaces, offering a modular and customizable user experience. Extensive experiments demonstrate an average of 1.59 throughput improvement compared with state-of-the-art baseline. The presented architecture in this work provides actionable insights for next-generation RL training system designs.

Abstract PDF Upgrade to Chat

Authors (19)

First 10 authors:

Summary

The paper introduces a novel asynchronous streaming RL framework that decouples tasks, achieving up to 2.03× throughput improvement over traditional methods.
It presents the TransferQueue module that enables efficient concurrent data management and dynamic task scheduling in RL workflows.
The design supports engine-agnostic integration and dynamic load balancing, enhancing scalability across diverse computing environments.

AsyncFlow: An Asynchronous Streaming RL Framework for Efficient LLM Post-Training

The paper "AsyncFlow: An Asynchronous Streaming RL Framework for Efficient LLM Post-Training" presents an innovative approach to address the challenges of reinforcement learning (RL) in the post-training phase of LLMs. The framework aims to overcome the scalability limitations of traditional task-collocated RL systems and the inefficiencies in task-separated RL approaches.

Introduction

AsyncFlow introduces a novel asynchronous streaming RL framework to enhance the post-training efficiency of LLMs. Traditional RL frameworks face two major challenges: scalability bottlenecks due to task-colocation and complex dataflows with underutilized resources in task-separated setups. Existing frameworks are often tightly integrated with specific training or inference engines, limiting flexibility. AsyncFlow addresses these issues through a decoupled architecture that supports a wide range of backends, allowing for dynamic load balancing and pipeline overlapping.

System Architecture

AsyncFlow's architecture consists of several layers:

Resource Layer: Utilizes Ray for computing resource management and optimized hardware allocation.
Backend Layer: Provides modular adapters compatible with various training and inference engines, maintaining engine-agnostic RL task execution.
Optimization Layer: Implements the TransferQueue for data management and an asynchronous workflow to maximize computational resource utilization.
Interface Layer: Offers a unified algorithm entry point and service-oriented APIs for seamless integration into different infrastructures.
Figure 1: System overview of AsyncFlow framework.

TransferQueue: High-Performance Data Management

TransferQueue, a centralized data management module, is a cornerstone of AsyncFlow. It supports asynchronous, distributed data storage and transfer, enabling efficient handling of data dependencies across RL tasks.

Architecture Design

TransferQueue separates the data plane from the control plane, managing task-specific data components in a 2D columnar format to support concurrent read/write operations.

Figure 2: Architecture design of TransferQueue, showing metadata communication and data handling processes.

Metadata and Control Plane

The control plane provides a centralized view of data statuses, updated via metadata notifications when new data is available. This setup enables real-time dynamic allocation of tasks, reducing idle times and improving load balancing in RL workflows.

Figure 3: Metadata notification process ensures controllers are updated with new data availability.

Asynchronous Workflow Optimization

AsyncFlow implements an asynchronous workflow to minimize pipeline bubbles and hardware idling by overlapping RL tasks.

Delayed Parameter Update

A delayed parameter update mechanism allows continuous actor rollout, even as updates are processed, avoiding synchronization overhead and extending the steady phase of RL pipelines.

Figure 4: Asynchronous off-policy RL workflow leveraging the delayed parameter update mechanism.

Scheduling and Resource Planning

Using a hybrid cost model combining analytical and profiling-based methods, AsyncFlow optimizes resource allocation, minimizing end-to-end execution times of RL workflows.

Service-Oriented User Interface

AsyncFlow's interface design separates algorithm-specific tasks from backend operations:

User-Level Interface: Provides RL algorithm controllers for easy integration and management.
Backend-Level Interface: Ensures compatibility with diverse backend engines through adapter classes.
Figure 5: Architecture of service-oriented user interface, enabling easy integration with various backend systems.

Evaluation

Extensive experiments demonstrate that AsyncFlow achieves up to 2.03× throughput improvement over state-of-the-art baselines in large-scale settings. The task-separated architecture exhibits superior scalability and resource utilization compared to task-collocated systems.

Figure 6: End-to-end throughput and scalability analysis underscore AsyncFlow's advantages in large clusters.

Conclusion

AsyncFlow effectively bridges the gap between research and industrial applications by providing a scalable and flexible framework for LLM post-training. Future developments may include further optimization of actor rollout and parameter updates to enhance the sub-step asynchronous workflow. The proposed innovations in data management and task scheduling pave the way for more efficient RL systems, particularly in large-scale AI deployments.

Markdown Report Issue