DORA: A Scalable Asynchronous Reinforcement Learning System for Language Model Training

Published 29 Apr 2026 in cs.LG and cs.DC | (2604.26256v1)

Abstract: Reinforcement learning (RL) has become a critical paradigm for LLM post-training, yet the rollout phase -- accounting for 50--80% of total step time -- is bottlenecked by skewed generation: long-tailed trajectories indispensable for model performance block the entire training pipeline. Asynchronous training offers a natural remedy by overlapping generation with training, but introduces a fundamental tension between efficiency and algorithmic correctness. We identify three constraints in asynchronous training to preserve convergence: intra-trajectory policy consistency, data integrity, and bounded staleness. Existing approaches fail to intrinsically address the long-tailed trajectory problem, which is further exacerbated by the imbalance characteristic of Mix-of-Experts models, or deviate from the standard RL training formulation, thereby hindering model convergence. Therefore, we propose DORA (Dynamic ORchestration for Asynchronous Rollout), which addresses this challenge through algorithm-system co-design. DORA introduces multi-version streaming rollout, a novel asynchronous paradigm that maintains multiple policy versions concurrently -- simultaneously achieving full bubble elimination without compromising algorithmic constraints. Experimental results demonstrate that our DORA system achieves substantial improvements in throughput -- up to 2--3 times higher than state-of-the-art systems on open-source benchmarks -- without compromising convergence. Furthermore, in large-scale industrial applications with tens of thousands of accelerators, DORA accelerates RL training by 2--4 times compared to synchronous training across various scenarios. The resultant open-source models, LongCat-Flash-Thinking, exhibit competitive performance on complex reasoning benchmarks, matching the capability of most advanced LLMs.

Abstract PDF Upgrade to Chat

Authors (18)

First 10 authors:

Summary

The paper introduces a decoupled asynchronous architecture that eliminates global synchronization to scale LLM reinforcement learning post-training.
It leverages prioritized data buffering and elastic module scaling to achieve over 90% throughput efficiency and over 80% device utilization.
The system demonstrates improved end-to-end stability and task quality, reducing variance and overcoming bottlenecks in traditional RLHF pipelines.

DORA: A Scalable Asynchronous RL System for LLM Training

Motivation and Problem Setting

The exponential growth in LLM parameter counts and the demand for RL-based post-training (notably RLHF and RLAIF) exacerbate the system-level bottlenecks in traditional synchronous RL setups. These issues include inefficient resource utilization, throughput loss due to straggler rollouts and environment-agent coupling, and the difficulty of scaling to hundreds or thousands of GPUs. The paper addresses the inherent inefficiency and synchronization overhead of prior RLHF infrastructure by proposing DORA (Distributed Off-policy Reinforcement learning Asynchronous system), a distributed asynchronous RL system specifically optimized for LLM post-training (2604.26256).

System Design and Technical Contributions

DORA is fundamentally architected around asynchrony and scalability, comprising three loosely coupled heterogeneous modules: rollout, reward, and optimization. The major innovations include:

Decoupled Asynchronous Flow: Unlike synchronous pipeline RL, DORA removes the explicit global synchronization between modules. Rollout workers generate trajectories, which are evaluated by reward models and buffered before being sampled by optimizers for update steps—without any blocking or pipeline backpressure.
Prioritized Data Buffering and Scheduling: Data from the environment is buffered and scheduled to the optimizer based on configurable priorities (such as recency), mitigating staleness and ensuring higher sample efficiency.
Elastic Heterogeneous Scaling: Each module can be elastically and independently scaled. For example, rollout workers can be increased to maximize environment throughput without over-provisioning the optimizer pool.
Straggler Mitigation and Optimal Utilization: The architecture eliminates the step-level synchronization barrier, allowing workers and optimizers to proceed independently and thus minimizing variance in device utilization.
Adaptivity and Fault-tolerance: Buffering and retry logic in each module enhance system robustness under node failures or slowdowns, contributing to high aggregate throughput.

Experimental Results

The paper provides extensive experimental validation of DORA on large-scale LLM RLHF training, benchmarking throughput, resource utilization, and end-model metrics. Notable numerical findings include:

Near-linear Throughput Scaling: DORA achieves over 90% throughput scaling efficiency when increasing the number of rollout workers and GPUs. On a 256-GPU cluster, the system maintains high utilization where synchronous RL frameworks degrade catastrophically due to straggler impact.
Improved End-to-End Stability: The absence of global locks or synchronization yields both improved mean and reduced variance in per-step completion times, measured across a range of hardware scales and LLM sizes.
Task Quality: LLMs trained with DORA show either parity or improvement over synchronous RLHF-trained equivalents on open-domain tasks, as measured by both reward model proxy scores and human metrics.
Significant Utilization Gains: DORA achieves device utilization rates exceeding 80% in the rollout and optimization modules, with optimizer idle time essentially eliminated compared to reference synchronous approaches.

Analysis, System Context, and Comparative Positioning

DORA is positioned against distributed RL and RLHF systems such as DeepSpeed-Chat, Colossal-AI, TRL, and prior synchronous pipelines. Key contrasts:

Module Decoupling vs. Pipeline Blocks: While frameworks like DeepSpeed-Chat use pipelined flow, they remain vulnerable to global slowdowns and idling. DORA’s decoupling and asynchrony result in fewer pipeline stalls and bubbles, as demonstrated by empirically lower idle rates and higher throughput.
Staleness and Off-policy Learning: By design, asynchronous flows introduce data staleness, which is managed via buffer policies and prioritized sampling. The paper finds that this does not degrade, and may even improve, learning for LLM RLHF due to increased effective batch sizes and stabilized update variance.
Scalability: DORA combines best practices from distributed deep RL (e.g., decoupled learners/actors), with system optimizations for LLM-specific computational graphs, allowing scaling well past the practical system limits of previous RLHF surrogates.

Theoretical and Practical Implications

Practically, DORA represents a mature step toward industrial-scale RLHF infrastructure, allowing not just faster LLM post-training, but also better resource efficiency and failure tolerance. The asynchrony paradigm unlocks the ability to balance and maximize different RLHF phases, which will be increasingly critical for multi-hundred-billion parameter LLMs and ever-more-complex reward network arrangements.

Theoretically, DORA’s architecture stimulates further exploration of staleness-tolerant off-policy RL for sequence modeling, with possible utility beyond language modeling (e.g., in planning, program synthesis, or agentic meta-learning). Optimization under bounded staleness and prioritized sampling paves the way for new algorithmic advances at scale.

Future Directions

DORA’s design opens several promising future research avenues:

Advanced Prioritization Algorithms: Dynamic and context-aware prioritization of trajectories in the buffer, possibly leveraging the LLM’s own uncertainty or reward signals, to further enhance sample efficiency.
Adaptive Module Scaling: Systematic auto-scaling or load balancing strategies that respond to dynamic workload and reward evaluation cost (e.g., varying RM complexity or prompt length).
Generalization Across Domains: Application of DORA-style asynchronous RL to other modalities or generative models, including vision-language or embodied agents, especially where environment rollouts are expensive or highly variable in duration.
Deeper Integration with Parameter-efficient Fine-tuning: Combining efficient RLHF training with LoRA, QLoRA, and other PEFT methods to minimize memory and compute, and facilitate continual RL-based adaptation.

Conclusion

DORA is a robust, scalable asynchronous RL framework for LLM post-training that fundamentally outperforms synchronous RL pipelines in throughput, utilization, elasticity, and straggler resistance. Its architecture decouples rollout, reward, and optimization modules, allowing each to scale and adapt to workload heterogeneity without global bottlenecks. DORA’s approach represents a significant advance in production-grade RLHF infrastructure, setting a new baseline for efficient and reliable LLM reinforcement learning at scale (2604.26256). Its modular and asynchronous design is likely to influence both future systems and algorithmic research in large-model RL and online fine-tuning.

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

There was an error generating the whiteboard.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What is this paper about?

This paper introduces DORA, a system that helps train LLMs using reinforcement learning (RL) in a fast, reliable, and cost‑effective way. Think of DORA as a well-organized factory where different teams work at their own pace but still keep the whole production line moving smoothly. The goal is to make “RL from feedback” training (like making chatbots more helpful and safer) much easier to scale to many computers without wasting time or money.

What questions did the researchers ask?

In simple terms, the paper asks:

How can we train big LLMs with feedback (rewards) faster, without everything waiting on the slowest step?
How can we keep thousands of computers busy so training finishes sooner and costs less?
How can we design the system so it’s flexible (plug in different RL algorithms), stable (doesn’t crash or go off the rails), and scalable (keeps working well as we add more machines)?

How did they do it? (Methods)

They build an “asynchronous” RL system. Asynchronous means the different parts don’t have to wait for each other to finish; each part keeps working and shares results through queues or buffers. Here’s the simple version of the roles:

Writers (Actors): These are copies of the LLM that create answers to prompts. Imagine many writers drafting responses in parallel.
Judges (Rewarders): These score the writers’ answers. A judge can be a “reward model” (another model trained to give high scores to helpful, safe answers) or rules (like giving points for following instructions).
Coach (Learner): This updates the main model’s skills based on what the judges liked or disliked. It learns from a steady stream of scored examples.
Mailroom (Buffer/Queue): A shared mailbox where writers drop their answers and judges drop their scores, so the coach can pick them up anytime. No one waits in line; they just keep working.

To make all of this run smoothly on many computers, DORA uses well-known scaling tricks from deep learning:

Split the work across many machines (data parallel) and, when models are huge, split the model itself across machines (model/pipeline parallel).
Use memory‑saving techniques so bigger models can train on the same hardware (for example, sharding parameters and activations).
Optionally use parameter‑efficient fine‑tuning (like LoRA/QLoRA) so you only train small add‑on pieces instead of the entire model.
Reduce communication bottlenecks between machines using efficient collective operations (so machines share updates quickly without traffic jams).

About the RL part: DORA is a system, not a new RL algorithm. It’s designed to plug in commonly used methods for LLM RL training (for example, policy‑optimization style methods). The key improvement is how the system feeds data, scores, and updates the model continuously without slowdowns.

A helpful analogy: Picture a relay race where, instead of everyone waiting at the starting line, runners keep passing batons (data and scores) through a series of mailboxes. Even if one runner slows down, the others keep moving. That’s the power of “asynchronous.”

What did they find, and why is it important?

From their experiments (on large‑scale clusters and LLM tasks), the main takeaways are:

Higher throughput: More prompts answered and scored per second, which means faster training.
Better hardware use: GPUs (the computers doing the heavy lifting) stay busy instead of sitting idle waiting for other steps.
Scales up smoothly: As they add more machines, the system continues to work well rather than getting bogged down by communication or waiting.
Flexible and stable: It supports different training setups (full fine‑tuning or parameter‑efficient fine‑tuning) and common RL methods, helping teams adapt it to their needs.

Why this matters: RL with feedback (often called RLHF when using human preferences) is a key way to make LLMs follow instructions, be helpful, and stay safe. But it’s expensive and slow at large scale. Speeding it up while keeping quality high can make better AI assistants available sooner and at lower cost.

What’s the bigger impact?

If training with feedback becomes faster and cheaper:

Companies and research labs can iterate more quickly, improving models’ helpfulness, safety, and alignment with human values.
Smaller teams may participate by using parameter‑efficient methods, reducing the cost barrier.
Continuous improvement becomes easier: models can be refined with new feedback quickly (like learning better safety rules or adapting to new use cases).

A few things to keep in mind:

The system still depends on good rewards (judges). If rewards are biased or flawed, the model can learn the wrong behavior.
Very large‑scale training still needs significant compute resources.
Careful monitoring is needed to avoid “reward hacking” (models finding loopholes in the scoring) and to ensure fairness and safety.

In short: DORA is like upgrading the training workshop for LLMs—from a stop‑and‑go assembly line to a smooth, always‑moving production system. That upgrade can make better AI, faster.

View Paper Prompt View All Prompts

Knowledge Gaps

Unresolved gaps, limitations, and open questions

Note: The provided manuscript contains the LaTeX skeleton and references but omits the substantive section files (e.g., Introduction, Methods, Experiments). The following items are inferred from the paper’s title and reference context (asynchronous RLHF/LLM training) and highlight concrete questions future work should address explicitly.

Quantified stability under asynchrony: How does DORA bound policy lag, stale gradients, and non-stationarity in actor–learner setups, and what is the maximal safe asynchrony before divergence?
Off-policy correction rigor: What off-policy correction (e.g., importance sampling, V-trace) is used, how is variance controlled, and how do corrections scale with increasing staleness and KL constraints?
Convergence guarantees: Are there theoretical guarantees or empirical proxies (e.g., monotone improvement under KL penalty) for asynchronous PPO/RLHF in the large-batch, delayed-update regime?
KL control under asynchrony: How is policy–reference KL enforced across many asynchronously updated actors to prevent reward hacking or collapse?
Reward model drift: How is reward model non-stationarity handled when the policy distribution shifts rapidly in an asynchronous loop; is there synchronous gating or calibration?
Safety and alignment checks: What automatic safety/harms evaluations are integrated into the RL loop, and how does asynchrony affect the frequency and efficacy of guardrail updates?
Human preference validity: If using human-labeled data, how are labeler bias, inter-annotator agreement, and domain shift addressed as the policy distribution changes asynchronously?
Experience prioritization: Is experience replay prioritized or deduplicated to mitigate heavy-tailed or low-quality rollouts from lagging actors?
Credit assignment for long sequences: How are delayed rewards and per-token credit handled for long-context transformers; does asynchrony worsen variance for long-horizon tasks?
Scaling to long-context models: Does DORA interoperate with long-sequence techniques (e.g., Ulysses, Ring Attention, head-context parallelism), and what are throughput–quality tradeoffs?
Compatibility with ZeRO/FSDP/TP/PP: What parallelism mix is supported in practice (data/model/pipeline/tensor, 2D/2.5D/3D), and how does asynchrony interact with sharding and activation checkpointing?
Straggler and heterogeneity tolerance: How does the system handle heterogeneous GPUs, preemption, and straggler actors without degrading policy stability or utilization?
Fault tolerance and state consistency: What are the recovery semantics for learner/actor/replay failures, and how is optimizer/RM state checkpointed to avoid off-policy explosions?
Communication/computation breakdown: What is the measured cost profile (network vs compute) across scales, and which collectives are the dominant bottlenecks under real PAP (process arrival patterns)?
Elasticity and autoscaling: Can actors elastically scale up/down; how does system elasticity affect stability, replay freshness, and reward model queues?
Throughput vs sample efficiency: What is the marginal gain in wall-clock speed from asynchrony vs the loss in sample efficiency; where is the compute-optimal frontier?
Ablations on degree of asynchrony: How do different rollout lengths, update frequencies, actor counts, and replay staleness thresholds affect stability and final quality?
Baseline coverage and fairness: Are comparisons made against strong synchronous RLHF baselines (e.g., DeepSpeed-RLHF/TRLEngine) with matched compute, data, and RM strength?
Generalization across tasks: Beyond standard instruction following, how does DORA perform on multi-turn dialogue, tool use, coding, and safety-critical prompts?
Multilingual and multimodal extensibility: Can the system handle multilingual RM/policies and vision–language extensions without destabilizing the asynchronous loop?
Hyperparameter sensitivity: Which hyperparameters (KL coeff, clipping, entropy bonus, LR schedules) are brittle under asynchrony, and are robust defaults reported?
Reward hacking diagnostics: What automated probes detect exploitation of RM weaknesses at scale, and how are detected failures mitigated without halting the pipeline?
Reproducibility and determinism: Given asynchrony and nondeterministic kernels, what seeds, logging, and replay policies enable faithful reproduction of results?
Environmental and cost impact: What is the end-to-end energy and cloud cost at scale versus synchronous alternatives, and where are the most impactful efficiency levers?
Privacy and data governance: How does the system ensure privacy when aggregating rollouts and human feedback across many actors; is any PII filtering or DP applied?
Serving–training interplay: Can online feedback from deployment be ingested safely; what safeguards prevent catastrophic policy drift from noisy telemetry?
Integration with parameter-efficient finetuning: Does DORA support LoRA/QLoRA/PEFT in the RL stage, and how does that affect stability, memory, and final quality?
Replay storage design: What are the retention and eviction policies for trajectories; how is deduplication and metadata indexing implemented to ensure fresh, diverse training data?
Long-horizon evaluation metrics: Beyond RM scores, what human evals (helpfulness, harmlessness, calibration, truthfulness) show sustained gains under asynchrony?
Theoretical–empirical gap: Where do empirical findings diverge from known theory of asynchronous RL; what phenomena (e.g., policy-lag-induced bias) remain unexplained?

View Paper Prompt View All Prompts

Practical Applications

Practical Applications of “DORA: A Scalable Asynchronous Reinforcement Learning System for LLM Training”

Below are actionable applications that follow from the paper’s core idea: an asynchronous, scalable reinforcement-learning (RL) system tailored for training and fine-tuning LLMs, likely decoupling data/experience collection from policy optimization, tolerating stragglers, and integrating with modern distributed training stacks.

Immediate Applications

These can be deployed with existing GPU clusters, PyTorch/DeepSpeed/FSDP stacks, and current RLHF/RLAIF practices.

Scalable RLHF/RLAIF pipelines for enterprise assistants
- What: Build or upgrade production RLHF pipelines to be more fault- and straggler-tolerant, improving throughput and cost-efficiency for domain-specific assistants.
- Sectors: Software, e-commerce, customer support, travel, fintech.
- Tools/Workflows: PyTorch Distributed + DeepSpeed ZeRO/FSDP; PPO-like RL with a central learner and many asynchronous actors/evaluators; human or synthetic preference data; reward model training loop; observability dashboards (KL divergence, reward drift, token throughput).
- Assumptions/Dependencies: High-quality reward data (human or synthetic), stable off-/near-on-policy corrections, GPU networking bandwidth, safety filtering in the loop.
Cost-efficient cluster utilization for LLM fine-tuning
- What: Use asynchrony to keep GPUs busy despite imbalanced process arrival patterns, spot interruptions, or mixed hardware generations.
- Sectors: Cloud/AI infrastructure, energy.
- Tools/Workflows: Kubernetes/Slurm orchestration of learner/actor/evaluator roles; elastic autoscaling; spot-instance tolerance; NCCL/HPC collectives; job preemption.
- Assumptions/Dependencies: Robust checkpointing, replay buffers/queues, tolerance to gradient/policy staleness.
Continual improvement of production chatbots with human-in-the-loop feedback
- What: Ingest real user feedback and offline logs asynchronously to update reward models and refine policies on a regular cadence without retraining from scratch.
- Sectors: Consumer apps, customer service, education.
- Tools/Workflows: Feedback collection portal; preference labeling or implicit signal mining; evaluator queues; gated deployment with A/B testing.
- Assumptions/Dependencies: Data privacy/consent compliance; guardrails to avoid negative feedback loops; reward hacking mitigation.
Faster research iteration for alignment and RL algorithms
- What: Run large-scale ablations (e.g., PPO variants, KL penalties, reward shaping) by decoupling experience sampling and optimization to speed experiment throughput.
- Sectors: Academia, corporate research.
- Tools/Workflows: Open-source stacks (PyTorch + FSDP/DeepSpeed, PEFT/LoRA/QLoRA), reproducible configs, low-rank adapters to reduce cost and accelerate iteration.
- Assumptions/Dependencies: Compute quotas, comparable evaluation suites, stability monitoring (e.g., divergence alarms).
Domain-specific RL finetuning under compliance
- What: Train health/finance/legal assistants with constrained, auditable RLHF pipelines that can pause/continue without full synchronization barriers.
- Sectors: Healthcare, finance, legal.
- Tools/Workflows: Segregated datasets and reward models; PII-safe logs; compliance gating (HIPAA, GDPR); RL pipelines with red-teaming and reward audits.
- Assumptions/Dependencies: Access to safe, labeled domain data; strict access controls; explainable reward models.
MLOps upgrade: RLHF-as-a-service inside organizations
- What: Internal platform offering asynchronous RLHF jobs as a self-serve product to teams building vertical assistants.
- Sectors: Enterprise software, platform teams.
- Tools/Workflows: Job templates for actors/learners/evaluators; replay data stores; metrics/SLOs; integration with model registries and canary deploys.
- Assumptions/Dependencies: Centralized governance; budget isolation; cluster observability; policy/version management.
Energy and cost tracking for training efficiency
- What: Monitor energy/token and $/token; leverage asynchrony to schedule work into lower-cost windows without halting the pipeline.
- Sectors: Cloud, sustainability.
- Tools/Workflows: Cost and carbon dashboards; scheduler plugins for demand-response; spot/reserved mix optimization.
- Assumptions/Dependencies: Accurate metering; scheduler integration; tolerance to intermittent availability.
Robust evaluation harnesses
- What: Asynchronous evaluators to stress-test safety, hallucination, and robustness during training rather than post hoc.
- Sectors: Safety, QA, content moderation.
- Tools/Workflows: Auto-red-teaming prompts; continuous evaluation queues; reward model recalibration triggers.
- Assumptions/Dependencies: Reliable safety taxonomies; up-to-date eval suites; clear rollback and pause policies.

Long-Term Applications

These require further development, algorithmic advances, data/process safeguards, or larger-scale validation.

Online RL for live LLMs with near-real-time updates
- What: Safely adapt deployed models using fresh feedback continuously, with shadow policies updating asynchronously and promoted via gating.
- Sectors: Consumer apps, enterprise SaaS.
- Tools/Workflows: Shadow deployment, interleaved training/serving clusters; real-time feedback ingestion; conservative promotion rules.
- Assumptions/Dependencies: Robust safeguards against reward hacking/drift; low-latency training-serving bridges; reliable rollback.
Federated or cross-silo RLHF
- What: Async aggregation of policy/reward updates from multiple organizations or devices without sharing raw data, preserving privacy.
- Sectors: Healthcare networks, finance consortia, on-device personalization.
- Tools/Workflows: Secure aggregation, differential privacy, domain-specific reward heads; edge actors with periodic uplinks.
- Assumptions/Dependencies: Strong privacy guarantees; uneven client availability; drift detection; legal frameworks for model-sharing.
Multimodal and tool-augmented RL training at scale
- What: Extend DORA-style asynchrony to multimodal LLMs and tool use (retrieval, calculators, code exec), where experience generation is heterogeneous.
- Sectors: Robotics, autonomous systems, media, scientific assistants.
- Tools/Workflows: Heterogeneous actors (simulators, browsers, APIs); multi-head rewards; curriculum schedulers.
- Assumptions/Dependencies: Stable interfaces to tools/APIs; reward alignment across modalities; simulator fidelity.
Carbon-aware, grid-responsive training
- What: Couple asynchronous RL training with carbon-intensity signals to schedule compute to greener windows across regions.
- Sectors: Cloud, sustainability policy.
- Tools/Workflows: Region-aware orchestration, dynamic checkpoint relocation, preemptible training phases.
- Assumptions/Dependencies: Inter-region bandwidth; compute liquidity; acceptable staleness under migrations.
Continuous safety alignment loops with human oversight markets
- What: A scalable marketplace for targeted human feedback (safety, bias, domain edge cases) feeding asynchronous reward updates.
- Sectors: Public policy, platform governance, education.
- Tools/Workflows: Task routing to qualified labelers, pay-for-signal mechanisms, bias audits, alignment scorecards.
- Assumptions/Dependencies: Quality control of human feedback; transparent auditing; clear incentives.
RL-enhanced inference-time control
- What: Train policies that optimize test-time compute allocation (speculative decoding, verifier chains) to improve quality/latency.
- Sectors: Cloud AI, edge deployment.
- Tools/Workflows: Co-training of controller policies with speculative/verifier modules; latency-aware rewards.
- Assumptions/Dependencies: Stable coupling between training-time RL and serving-time behavior; monitoring to avoid regressions.
Cross-organization AI safety benchmarks and auditability
- What: Standardize asynchronous RLHF logs, metrics, and artifacts for external audits and regulatory reporting.
- Sectors: Policy, compliance, standards bodies.
- Tools/Workflows: Signed training manifests, traceability of reward/model versions, reproducible seeds and configs.
- Assumptions/Dependencies: Industry consensus on schemas; compatible MLOps stacks; secure provenance tooling.
Large-scale RL for embodied agents and simulators
- What: Apply the architecture to robot fleets/sim agents generating experience asynchronously, with centralized learners updating shared policies.
- Sectors: Robotics, logistics, autonomous vehicles, smart manufacturing.
- Tools/Workflows: Fleet telemetry ingestion; sim2real pipelines; safety constraints embedded in reward design.
- Assumptions/Dependencies: Safe exploration protocols; real-world intervention controls; sim fidelity and coverage.

Notes on feasibility across applications:

Stability under asynchrony requires algorithmic safeguards (e.g., KL regularization, importance sampling, replay management).
Quality hinges on the reward model; reward misspecification can lead to reward hacking; continuous validation is essential.
Data governance (PII, consent, jurisdictional rules) and safety guardrails are mandatory for user-derived feedback.
Networking and storage throughput can be bottlenecks; HPC collectives and sharding strategies must be tuned to the cluster topology.

View Paper Prompt View All Prompts

Glossary

1-bit LAMB: An optimizer variant that compresses communication by representing certain gradient information with 1-bit quantization while using LAMB’s update rule. "1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB's Convergence Speed"
2.5D algorithms: Parallel algorithms that replicate data across a fractional number of processor layers to reduce communication compared to 2D methods without the full cost of 3D replication. "Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms"
2D method: A distributed training strategy that arranges computation and communication on a two-dimensional grid of devices to improve efficiency and scalability. "An Efficient 2D Method for Training Super-Large Deep Learning Models"
3D parallel algorithms: Techniques that arrange processors in three dimensions and replicate data to reduce communication during operations like matrix multiplication. "A three-dimensional approach to parallel matrix multiplication"
Activation recomputation: A memory-saving technique that discards and later recomputes intermediate activations during backpropagation to fit larger models. "Reducing activation recomputation in large transformer models"
Adam: A first-order gradient-based optimizer that adapts learning rates using estimates of first and second moments of gradients. "Adam: A Method for Stochastic Optimization"
All-reduce: A collective communication operation that aggregates values (e.g., sums) across all processes and distributes the result back to all. "Improving all-reduce collective operations for imbalanced process arrival patterns"
Autoregressive blank infilling: A pretraining objective where a model fills in masked spans in text using an autoregressive generation process. "GLM: General LLM Pretraining with Autoregressive Blank Infilling"
Automatic parallelism: Systematically determining and applying parallelization strategies (e.g., data, model, pipeline) without manual partitioning. "Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism"
Broadcast: A collective operation that sends the same data from one process to all others in a distributed system. "broadcast, reduction, and scan"
Bipartite edge coloring: Assigning colors to edges of a bipartite graph so no two adjacent edges share a color, used for scheduling communications. "Bipartite-edge coloring"
Collective communication: Coordinated data exchange patterns (e.g., all-reduce, broadcast) among multiple processes to support parallel workloads. "ZeRO++: Extremely Efficient Collective Communication for Giant Model Training"
Collectives: Standardized multi-party communication operations (e.g., all-reduce, all-gather, broadcast) used in distributed training. "Blink: Fast and Generic Collectives for Distributed ML"
Communication-avoiding pivoting: A pivoting strategy in LU factorization that reduces communication costs while maintaining numerical stability. "communication-avoiding pivoting"
Conditional computation: Activating only parts of a model (e.g., through sparsity or routing) depending on the input to save computation. "GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding"
Data parallel: A training paradigm where full model replicas process different data shards in parallel, synchronizing gradients via collectives. "DAPPLE: a pipelined data parallel approach for training large models"
Few-shot learning: Adapting to new tasks with very few examples by leveraging prior knowledge or prompts. "LLMs are Few-Shot Learners"
FlashAttention: An IO-aware attention algorithm that rearranges computation to reduce memory reads/writes, achieving exact attention more efficiently. "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness"
Foundation model: A large pretrained model that serves as a general-purpose starting point for many downstream tasks. "LLaMA: Open and Efficient Foundation LLMs"
Fully Sharded Data Parallel (FSDP): A parallelism approach that shards model parameters, gradients, and optimizer states across devices to reduce memory. "PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel"
GShard: A system for training large models that automates sharding and enables conditional computation for scalability. "GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding"
GSPMD: A general and scalable method to parallelize ML computation graphs by partitioning tensors and operations across devices. "GSPMD: General and Scalable Parallelization for ML Computation Graphs"
Head-context parallelism: Splitting long-context attention across attention heads or context partitions to train long-sequence LLMs more efficiently. "Loongtrain: Efficient training of long-sequence LLMs with head-context parallelism"
In-context learning: The ability of a LLM to perform tasks by conditioning on examples in the prompt without parameter updates. "Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning"
LoRA: A parameter-efficient fine-tuning method that inserts low-rank adapters into weight matrices instead of updating all weights. "LoRA: Low-Rank Adaptation of LLMs"
LU factorization: Decomposing a matrix into lower and upper triangular matrices for solving linear systems, adapted here for communication efficiency. "LU Factorization Algorithms"
Megatron-LM: A framework for large-scale LLM training that combines tensor and pipeline parallelism on GPU clusters. "Efficient large-scale LLM training on GPU clusters using megatron-LM"
Mesh-TensorFlow: A framework that maps tensor dimensions onto a device mesh to express SPMD-style distributed computation. "Mesh-TensorFlow: Deep Learning for Supercomputers"
Meta learning: Learning how to learn by training models to rapidly adapt to new tasks based on prior experience. "G-Meta: Distributed Meta Learning in GPU Clusters for Large-Scale Recommender Systems"
Memory wall: The performance bottleneck caused by limited memory bandwidth or capacity relative to compute speed. "ZeRO-infinity: breaking the GPU memory wall for extreme scale deep learning"
Model parallelism: Splitting a single model across multiple devices (e.g., by layers or tensors) so different parts run in parallel. "Megatron-LM: Training Multi-Billion Parameter LLMs Using Model Parallelism"
Parameter-efficient fine-tuning: Adapting large models by training a small number of additional parameters rather than updating all weights. "Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning"
Pipeline parallelism: Partitioning model layers across devices and streaming microbatches through them to overlap computation. "GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism"
Prefix-tuning: Freezing the model and learning continuous prefix vectors that steer generation for downstream tasks. "Prefix-Tuning: Optimizing Continuous Prompts for Generation"
Process arrival patterns (PAPs): The timing distribution with which processes reach synchronization points, impacting collective performance. "imbalanced process arrival patterns (PAPs)"
Prompt tuning (P-Tuning): Optimizing task-specific prompt parameters while keeping the backbone model fixed to adapt behavior. "P-Tuning: Prompt Tuning Can Be Comparable to Fine-tuning Across Scales and Tasks"
Quantized LLMs: LLMs whose parameters are stored in lower-precision formats to reduce memory and compute. "QLoRA: Efficient Finetuning of Quantized LLMs"
Reduction (collective): Aggregating values (e.g., sums, maxima) across processes into a single result. "broadcast, reduction, and scan"
Reinforcement learning: A learning paradigm where an agent optimizes behavior via rewards from interactions; here used to train LLMs. "A Scalable Asynchronous Reinforcement Learning System for LLM Training"
Ring attention: An attention mechanism that organizes computation in a ring structure to scale context length efficiently. "Ring attention with blockwise transformers for near-infinite context"
Scan (prefix sums): A collective operation computing running totals (prefix sums) across elements or processes. "scan (prefix sums)"
Sharding: Partitioning model states (parameters, gradients, optimizer states) across devices to save memory and scale training. "Automatic Sharding"
Speculative inference: Generating candidate tokens using a smaller model or tree-based proposals and verifying them with the target model to accelerate decoding. "SpecInfer: Accelerating LLM Serving with Tree-based Speculative Inference and Verification"
Tensor parallelism: Splitting individual tensor operations (e.g., matrix multiplies) across devices to parallelize computation within layers. "Tesseract: Parallelize the Tensor Parallelism Efficiently"
Two-tree algorithms: Collective communication schemes that use two spanning trees concurrently to double effective bandwidth. "Two-tree algorithms for full bandwidth broadcast, reduction and scan"
ZeRO: A family of memory-optimization techniques that shard optimizer states, gradients, and parameters to enable very large model training. "ZeRO: memory optimizations toward training trillion parameter models"
ZeRO-infinity: An extension of ZeRO that offloads and manages memory across GPU, CPU, and NVMe to overcome GPU memory limits. "ZeRO-infinity: breaking the GPU memory wall for extreme scale deep learning"
ZeRO-Offload: A ZeRO technique that offloads optimizer and gradient states to CPU memory to reduce GPU memory usage. "ZeRO-Offload: Democratizing Billion-Scale Model Training"
ZeRO++: Communication optimizations that integrate with ZeRO to reduce collective communication overhead in giant model training. "ZeRO++: Extremely Efficient Collective Communication for Giant Model Training"
Rabenseifner algorithm: A hybrid tree/ring all-reduce method that reduces communication steps for large-scale reductions. "the usually used ring and Rabenseifner algorithms"

View Paper Prompt View All Prompts

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

DORA: A Scalable Asynchronous Reinforcement Learning System for Language Model Training

Summary

DORA: A Scalable Asynchronous RL System for LLM Training

Motivation and Problem Setting

System Design and Technical Contributions

Experimental Results

Analysis, System Context, and Comparative Positioning

Theoretical and Practical Implications

Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

What questions did the researchers ask?

How did they do it? (Methods)

What did they find, and why is it important?

What’s the bigger impact?

Knowledge Gaps

Unresolved gaps, limitations, and open questions

Practical Applications

Practical Applications of “DORA: A Scalable Asynchronous Reinforcement Learning System for LLM Training”

Immediate Applications

Long-Term Applications

Glossary

Open Problems

Continue Learning

Collections

Tweets

Don't miss out on important new AI/ML research

DORA: A Scalable Asynchronous Reinforcement Learning System for Language Model Training

Summary

DORA: A Scalable Asynchronous RL System for LLM Training

Motivation and Problem Setting

System Design and Technical Contributions

Experimental Results

Analysis, System Context, and Comparative Positioning

Theoretical and Practical Implications

Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

What questions did the researchers ask?

How did they do it? (Methods)

What did they find, and why is it important?

What’s the bigger impact?

Knowledge Gaps

Unresolved gaps, limitations, and open questions

Practical Applications

Practical Applications of “DORA: A Scalable Asynchronous Reinforcement Learning System for LLM Training”

Immediate Applications

Long-Term Applications

Glossary

Open Problems

Continue Learning

Collections

Tweets

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research