Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

173 tokens/sec

GPT-4o

7 tokens/sec

Gemini 2.5 Pro Pro

46 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

AsyncFlow: Asynchronous Streaming RL Framework

Updated 4 July 2025

AsyncFlow is an asynchronous streaming reinforcement learning framework that decouples data management and engine operations for scalable LLM post-training.
It utilizes a modular, hierarchical architecture with a distributed TransferQueue to optimize resource allocation and minimize idle pipeline time.
Empirical results show up to 2.03× throughput improvement and near-linear scaling efficiency, ensuring stable performance with minimal staleness.

AsyncFlow is an asynchronous streaming reinforcement learning (RL) framework designed for scalable and efficient post-training of LLMs. The framework addresses the major bottlenecks of previous RL-based LLM post-training systems—specifically, scalability limitations in task-colocated frameworks, idling and imbalance in task-separated frameworks, and tight coupling with specific training or inference engines. AsyncFlow introduces a modular, hierarchical architecture featuring a distributed data management system, fine-grained pipeline scheduling, and a producer-consumer asynchronous workflow. Extensive experiments demonstrate substantial throughput improvements over state-of-the-art baselines, positioning AsyncFlow as a reference architecture for next-generation RL post-training in both research and industrial settings.

1. Architectural Overview

AsyncFlow is composed of four hierarchical layers, each providing specific functionality and collectively enabling engine-agnostic, scalable RL post-training:

Resource Layer: Leverages Ray for distributed resource management, optimizing hardware resource allocation based on execution-time simulation.
Backend Layer: Implements modular adapters for a variety of LLM training/inference backends (e.g., FSDP, DeepSpeed, vLLM, MindSpeed), ensuring decoupling from any particular engine.
Optimization Layer: Contains the TransferQueue module—a distributed, streaming data storage/transfer system—and orchestrates the asynchronous producer-consumer workflow for all RL subtasks.
Interface Layer: Provides service-oriented APIs for both RL algorithm users (e.g., Trainer class) and backend integrators (e.g., Adapter class).

A central architectural innovation is the TransferQueue, which serves as a unified, atomic, distributed queue for intermediate results of RL tasks. It exposes control-plane interfaces for metadata management and a data plane for scalable streaming data access. Each RL subtask (actor rollout, reward, critic, update, etc.) interacts with the TransferQueue's controller to exchange fine-grained data, supporting high concurrency and dynamic scheduling.

2. Producer-Consumer-Based Asynchronous Workflow

AsyncFlow implements an asynchronous workflow across all RL tasks based on the producer-consumer model:

Pipeline Overlap: All RL sub-tasks operate concurrently, continuously consuming and producing data through the TransferQueue, thereby achieving automated pipeline parallelism.
Minimized Idleness (Pipeline Bubbles): In contrast to strict on-policy regimes, the actor rollout process is allowed to proceed with slightly stale parameters (typically 1-step old), deferring synchronization with the actor update. This deferral, bounded by a staleness threshold, is shown to be sufficient to maintain stable and effective RL learning in practice.
Sub-step Asynchrony: Actor rollout workers update their parameters in a pipelined, sequential (per device) manner; updated parameters are phased in gradually, reducing hardware idle time to minimum.
Dynamic Load Balancing: Schedulers utilize real-time metadata to redirect workloads among subtasks, adapt to imbalances, and fully utilize heterogeneous hardware.

Figure 6 in the paper visually contrasts traditional on-policy RL (prominent idleness at pipeline boundaries) with AsyncFlow's asynchrony (pipelines always filled, minimal warm-up/cool-down bubbles).

3. Decoupling and Modular Design

A key principle in AsyncFlow is the decoupling of optimization logic from the underlying training and inference engines:

The Backend Layer employs adapter modules to interface with diverse engine backends; these adapters mediate parameter updates, inference requests, and task-specific control signals.
The RL workflow, along with data management and scheduling logic, is implemented independently in upper layers and communicates with backend engines through standardized interfaces.
This architectural separation ensures that new engine/accelerator types (including custom research platforms or future industrial standards) can be adopted by implementing the appropriate adapter, with no changes required to the core RL workflow or TransferQueue logic.

Service-oriented APIs (see Section 5) encapsulate the full interaction surface: algorithm developers use the Trainer class for orchestration, while engine integrators implement the Adapter class for backend bridging.

4. Empirical Performance and Scaling Properties

AsyncFlow is evaluated on Qwen2.5 models (7B–32B) across NPU clusters ranging from 32 to 1024 cores:

Throughput: AsyncFlow achieves an average throughput improvement of 1.59× compared to verl (the state-of-the-art task-collocated RL framework), with up to 2.03× at large scales (256 NPUs, 7B model).
Scaling Efficiency: Sustains scaling efficiencies of 0.65 (32B LLM) and 0.88 (7B LLM) when increasing cluster size by 16×, indicating near-linear scaling behavior.
Ablation Analysis: The TransferQueue module alone doubles throughput relative to baseline; the asynchronous workflow (1-step staleness) contributes an additional 36.3% throughput gain.
Stability: No empirical degradation of RL convergence or policy reward is observed with the 1-step staleness setting; convergence degradation is only observed when staleness thresholds are increased significantly (in line with previous reports for off-policy RL).

Table 1 in the paper provides normalized throughput figures for baseline, TransferQueue-only, and full asynchronous pipeline configurations.

5. Data Management and Scheduling

The TransferQueue is central to AsyncFlow's dataflow management:

Decentralized but Unified: Each RL task maintains a controller for its data transfer needs, interacting with distributed storage units; data is organized as a two-dimensional structure (columns for tasks, rows for global sample indices).
Atomic Operations: Atomic read/write guarantees are provided for streaming data access, ensuring no redundancy or data races across concurrent consumers.
Fine-Grained Load Balancing: Controllers broadcast data state and requirements, allowing scheduling algorithms to assign data production/consumption actions optimally and reduce resource stranding.

The producer-consumer interaction, orchestrated by this mechanism, enables pipeline stages to operate at maximal concurrency.

6. User and Backend Interfaces

AsyncFlow delivers hierarchical, service-oriented interfaces for both algorithm developers and backend integrators:

Algorithmic Interface (Trainer): Unified abstraction for RL algorithm orchestration; APIs include init_engines, put_prompts_data, put_experience_data, get_experience_data, and synchronization primitives.
Backend Interface (Adapter): Defines the necessary RL task interfaces for backend engines; each adapter can specialize implementation for its engine, enabling straightforward extension to new systems.
Workflow Modularity: Algorithms, data pipelines, and engine wiring can be independently configured, tested, and replaced with minimal cross-impact.

This modular approach enables both academic rapid prototyping (new RL methods/engine combos) and industrial deployment (robust, customizable workflows).

7. Methodological Insights and Future Directions

Key design principles and outcomes from AsyncFlow include:

Centralized Control Plane, Distributed Data Plane: Balances global coordination with scalable, parallel data access.
Tolerant Asynchrony: Deferral of parameter updates within a staleness threshold yields substantial throughput gains, with no significant cost to RL performance for typical values (1 step).
Modular, Decoupled Design: Ensures future extensibility to new engines and customizable workflows in both research and industrial settings.
Productivity and Reproducibility: Service-oriented APIs and modular layers reduce code complexity, improve usability, and promote reproducibility.
Scaling Laws: Empirical results suggest that task-separated, asynchronous streaming RL frameworks scale more efficiently than task-collocated designs as cluster/model size increases.
Future Prospects: Sub-step asynchrony (parameter update below 1-step granularity), tighter autoscaling, and further adaptive scheduling are identified as promising directions.

Summary Table

Feature	AsyncFlow Implementation	Benefit
Distributed data pipeline	TransferQueue with control/data plane separation	Maximal concurrency and fine-grained I/O
Producer-consumer asynchrony	Pipeline overlap, bounded staleness in parameter updates	Minimized idle/wait time
Engine decoupling	Modular Backend Layer and adapters	Support for diverse/custom backends
Service-oriented user/backend API	Unified interfaces (`Trainer`, `Adapter`)	Customizability and ease of integration
Empirical throughput improvement	1.59× average, up to 2.03× in large clusters	Higher scalability and cluster efficiency
RL task modularity	Independent, finely scheduled RL subtasks via TransferQueue	Dynamic load balancing

AsyncFlow provides a principled, modular, and empirically validated foundation for highly scalable RL-based post-training of LLMs, with actionable design patterns and interfaces suitable for deployment across diverse hardware and backend environments.

PDF Markdown Chat (Upgrade)