Efficient RL Training for LLMs with Experience Replay
Abstract: While Experience Replay - the practice of storing rollouts and reusing them multiple times during training - is a foundational technique in general RL, it remains largely unexplored in LLM post-training due to the prevailing belief that fresh, on-policy data is essential for high performance. In this work, we challenge this assumption. We present a systematic study of replay buffers for LLM post-training, formalizing the optimal design as a trade-off between staleness-induced variance, sample diversity and the high computational cost of generation. We show that strict on-policy sampling is suboptimal when generation is expensive. Empirically, we show that a well-designed replay buffer can drastically reduce inference compute without degrading - and in some cases even improving - final model performance, while preserving policy entropy.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Overview
This paper looks at a simple idea from reinforcement learning (RL) and applies it to training LLMs: instead of generating new training examples and throwing them away after one use, keep them in a “replay buffer” and reuse them. The goal is to save a lot of computation (and money) while keeping the model just as good—or even better—at solving problems like math and coding.
Key Questions
- Can LLMs learn well even if we reuse older training examples, instead of always generating fresh ones?
- How should we design a replay buffer (how big it is, how often we reuse items) to get the best trade-off between speed and accuracy?
- Does reusing examples hurt the model’s diversity (the variety of answers it can produce) or stability (how smoothly training goes)?
- When generation (making new examples) is expensive, is “strictly fresh data” really the best way to train?
Methods and Approach
Think of training as two teams:
- Inference workers: like “writers” who generate new model attempts (called “rollouts”) on problems.
- Trainers: like “editors” who use those rollouts to update the model.
Normally, writers create a batch of rollouts, editors use them once, and then they’re discarded. That’s called on-policy training (always using the latest model’s freshly generated data).
This paper adds a replay buffer—a shared “notebook” that stores past rollouts. Editors can sample from the notebook multiple times. This reduces how often writers need to generate new rollouts, which saves compute.
Key ideas explained in everyday terms:
- Off-policy (staleness): Training on rollouts made by older versions of the model. It’s like studying from last week’s notes instead of today’s—maybe slightly out of date, but still useful.
- Diversity: If you reuse the same examples too often, you see less variety. This can make learning worse. The paper looks at “global” diversity (how many times you reuse an example overall) and “local” diversity (how often you reuse it back-to-back).
- Replay ratio: How many times, on average, a saved rollout gets reused. Higher replay ratio means more reuse and less new generation.
- Compute efficiency: Training cost is mostly two parts—editing steps (trainer compute) and writing new rollouts (inference compute). Reusing reduces the writing part.
- Asynchronous training: Writers and editors run in parallel, each with their own copy of the model that gets updated periodically. This setup is common in real LLM training.
- Theory (bias-variance trade-off): The authors model how “using older data” introduces some bias and extra noise, but also how it reduces expensive generation. They show there’s a sweet spot—an optimal buffer size and reuse level—especially when generation is costly.
Experiments:
- Models: Qwen3-0.6B and Qwen2.5-7B.
- Tasks: Math reasoning datasets (like MATH and OpenR1-Math-220k).
- They vary buffer size, the ratio of writers to editors, and sampling rules.
- They measure accuracy, pass@k (how often the correct answer appears among k attempts), training stability, and compute used.
Main Findings
- Big compute savings with little to no accuracy loss:
- With a well-chosen buffer, they save up to about 40% of compute while matching or even beating the no-buffer baseline.
- Savings come from reusing rollouts instead of constantly generating fresh ones.
- Training becomes more stable:
- Replay buffers smooth training and reduce crashes or sudden drops in performance.
- Some setups reach higher peak accuracy than strictly on-policy training.
- Output diversity is preserved (and sometimes improved):
- Pass@k scores go up, especially for larger k, meaning the model produces a broader set of good answers.
- Reusing a mix of older samples can act like a regularizer—helping the model avoid overfitting to the latest data.
- There’s a trade-off:
- Too much reuse (very high replay ratio) or too stale data can eventually hurt performance.
- Moderate buffer sizes and reuse rates tend to work best.
- If generating rollouts is expensive, strict on-policy is not optimal; replay becomes the better choice.
- Wall-time (actual clock time) can also improve:
- Buffers help avoid “stalling” when writers or editors wait on each other, keeping the pipeline flowing.
- Simple tweaks can help more:
- “Positive-bias” sampling (keeping more correct rollouts) and alternative losses (like AsymRE) can further stabilize training and boost results in some cases.
Why It Matters
LLM RL training is very expensive because you constantly need new model-generated examples. This paper shows a practical way to cut that cost: reuse what you already made. The authors provide both theory and experiments showing that:
- Reuse can be safe and effective.
- There’s an optimal balance between fresh and reused data.
- You can get more “accuracy per unit of compute,” which is ideal for real-world systems with limited budgets.
This shifts the focus from “best per step” to “best per compute.” It means labs and companies can train reasoning-capable LLMs more efficiently, making advanced models more accessible. In the future, smarter sampling rules and off-policy techniques could push these gains even further, especially for larger, frontier-scale models.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
- Scaling and generality: Does the reported 20–40% compute saving and stability hold for frontier-scale LLMs, longer contexts, and other domains (e.g., code generation, multimodal RL, dialogue safety) beyond math reasoning?
- Algorithm coverage: How do findings transfer to other post-training algorithms (e.g., PPO, DPO/RPO variants, V-trace, BCO/BC+RL hybrids) and different advantage estimators or KL penalties typical in RLHF pipelines?
- Theory–practice gap: Theoretical analysis assumes a synchronous FIFO buffer and idealized bias/variance conditions, while experiments are asynchronous; how do results change under realistic asynchronous staleness, parameter lag, and queue dynamics?
- Unidentified parameters: Theoretical optima depend on unobserved quantities (κ for bias, ρ for correlation, and σ(·) for off-policy variance growth); how can these be estimated online and used to set and adaptively?
- Importance sampling and corrections: What is the comparative impact of IS, truncated IS, V-trace, Retrace, or no-IS (e.g., AsymRE) under varying staleness and replay ratios on variance, stability, and final quality?
- Adaptive control: Can we design controllers that tune buffer size, replay ratio, and on-the-fly from observed KL(stale,current), effective sample size, loss variance, or performance signals to stay on the Pareto frontier?
- Diversity mechanics: Local vs global diversity were hypothesized to have different effects; can we causally disentangle and quantify their contributions and develop sampling rules that explicitly enforce local diversity (beyond “without replacement,” which was inconclusive)?
- Exploration impact: Does replay dampen exploration and novelty in tasks requiring it (sparse/long-horizon rewards)? How to inject novelty-aware or uncertainty-aware sampling without sacrificing efficiency?
- Buffer design space: Systematic evaluation of prioritized replay (e.g., by reward, margin, uncertainty), reservoir sampling, deduplication, diversity-constrained sampling, age- or KL-weighted sampling, and hybrid FIFO+prioritization is missing.
- Positive-bias sampling risks: How does favoring “correct” rollouts affect exploration, reward hacking, or overfitting to easy solutions over time? What safeguards are needed to avoid bias amplification?
- Safety and content filtering: How should buffers handle toxic, reward-hacked, or unsafe trajectories that may be repeatedly replayed and reinforced?
- Collapse and failure modes: Training often attains a peak then collapses; what mechanisms cause this and how exactly does replay mitigate it? Can we develop diagnostics and interventions (e.g., entropy/KL schedules, replay annealing) to prevent collapse?
- Entropy preservation: The abstract claims entropy preservation; comprehensive measurement of policy entropy over training and its relation to pass@k and replay configuration is not provided.
- Compute accounting realism: The compute model uses a fixed μ; in practice μ evolves (e.g., with sequence lengths, verification costs). How to model and exploit dynamic μ for scheduling and buffer sizing?
- Systems overhead: Memory/communication costs of large, sharded buffers and their effect on throughput and latency are not quantified; what are the scaling limits and engineering trade-offs?
- Reward pipeline variability: How do replay strategies perform with different reward types (pass/fail, dense, learned RMs, execution-based, multi-objective) and variable-latency evaluators?
- Off-policiness metric: “Steps since generation” is a crude staleness proxy; would KL divergence to the current policy, effective sample size, or IS weight dispersion be more informative and controllable targets?
- Hyperparameter coverage: Only learning rate was swept extensively; sensitivity to batch size, KL strength, advantage normalization, reward scales, and sequence truncation remains unexplored for replayed training.
- Multi-domain and continual learning: Can replay mitigate catastrophic forgetting across mixed-task post-training, and how should buffers be partitioned or scheduled across domains?
- Correlation-induced bias: Theoretical bias from sample–iterate coupling is acknowledged but not measured; can we empirically quantify this bias and test debiasing strategies (e.g., thinning, decorrelated sampling)?
- Asynchronous staleness control: What is the optimal cadence for synchronizing worker/trainer weights under replay, and how do push/pull frequencies interact with buffer parameters?
- Partial/segment replay: Would token- or segment-level replay (e.g., reusing initial reasoning prefixes only) improve credit assignment and diversity compared to full-trajectory replay?
- Long-horizon effects: How does replay affect length and structure of reasoning traces, and credit assignment over multi-step solutions?
- Robustness metrics: Beyond accuracy and pass@k, how does replay influence calibration, variance across seeds, robustness to prompt perturbations, and hallucination/formatting error rates?
- Generalization of preliminary refinements: Positive-bias sampling and AsymRE showed promise on small models; do these benefits persist at larger scales and across tasks, and what are the best practices to deploy them safely?
Practical Applications
Overview
This paper shows that adding a well‑designed experience replay buffer to RL post‑training of LLMs can cut inference (generation) compute by up to ~40% without hurting—and sometimes improving—final accuracy and output diversity. It provides (1) an implementation blueprint for asynchronous pipelines (with inference workers W, trainers T, buffer size N); (2) a theory quantifying the optimal trade‑off among staleness, sample diversity, and rollout/training cost ratio μ; and (3) empirical guidance on stable hyperparameter regimes and sampling strategies (e.g., FIFO, positive‑biased sampling, and the AsymRE loss).
Below are practical applications grouped by deployment horizon. Each item highlights sector(s), what to do, likely tools/workflows, and key assumptions/dependencies.
Immediate Applications
These can be piloted and deployed now with modest changes to current RLFT pipelines.
- Industry (AI labs, foundation model teams): Compute‑efficient RL post‑training for reasoning/coding LLMs
- What to do: Integrate a FIFO replay buffer into existing asynchronous PPO/GRPO/GRPO‑like pipelines; tune W/T and buffer size N to target a desired compute ratio γ and replay ratio; monitor off‑policiness and diversity metrics.
- Tools/workflows:
- Add a sharded replay buffer service (in trainer processes) with uniform sampling; track per‑sample “staleness,” replay ratio, and time‑since‑last‑use.
- Use the paper’s compute ratio formula γ = (1 + W/T) / (1 + μ) and target moderate replay ratios (e.g., 2–6) with buffer sizes that increase local diversity.
- Add dashboards for staleness/diversity metrics and pass@k.
- Assumptions/dependencies: Asynchronous architecture with sufficient memory/IO for buffering; tasks with nontrivial rollout costs (large μ); careful LR tuning; validated on 0.6B and 7B models (porting to frontier scales remains to be verified).
- Software sector (code assistants, reasoning copilots)
- What to do: Use experience replay during RLFT on code/math datasets to reduce GPU hours and stabilize training (helps avoid collapse, preserves/pass@k improves).
- Tools/workflows: TRL‑like trainers with buffer plugins; per‑project knobs for W/T and N; A/B compute‑per‑accuracy reporting.
- Assumptions/dependencies: Rollouts are long enough to dominate cost; evaluation covers pass@1 and pass@k; reward model variance under off‑policy is managed.
- Cloud providers and MLOps platforms: Cost‑aware RLFT offerings
- What to do: Offer a “replay‑optimized” RLFT mode with auto‑tuning of W/T, N, and batch/replay ratios; expose compute savings estimates and wall‑time smoothing.
- Tools/workflows:
- Auto‑scheduler that estimates μ online and sets W/T; smooth queues using buffers to reduce trainer/inference stalls.
- Built‑in monitoring for staleness/diversity and early‑warning for off‑policy variance spikes.
- Assumptions/dependencies: Accurate μ estimation; multi‑tenant scheduling constraints; storage bandwidth for replay.
- Academia and education: Lower‑cost RL for LLM research and teaching
- What to do: Adopt replay buffers to reduce inference GPU needs in academic RLFT experiments; use the paper’s diagnostics to teach off‑policy trade‑offs.
- Tools/workflows: Open‑source buffer modules; teaching labs demonstrating impact of W/T, N, and sampling strategies on compute and stability.
- Assumptions/dependencies: Small/mid‑scale models, open datasets; reproducible pipelines.
- Energy and sustainability: Reduced carbon footprint for RLFT
- What to do: Account for and report compute/energy reductions achieved by replay; integrate into sustainability metrics and disclosures.
- Tools/workflows: Emission dashboards that separate inference vs. training energy; “performance per kWh” KPIs.
- Assumptions/dependencies: Comparable task outcomes; accurate power metering; governance alignment.
- Training stability and quality assurance across sectors (healthcare, finance, compliance RAG/reasoning)
- What to do: Use moderate off‑policiness via replay to stabilize RLFT and preserve entropy/pass@k; add guardrails to prevent late‑stage collapse.
- Tools/workflows: Early‑stopping tuned on pass@k; staleness ceilings; LR schedules co‑tuned with replay; reward variance checks.
- Assumptions/dependencies: Extra QA for safety‑critical domains; data governance for storing trajectories; stronger eval suites for distribution shift from off‑policy data.
- Pipeline engineering: Wall‑time smoothing in asynchronous systems
- What to do: Use replay buffers to decouple production/consumption, reducing stalls from variable reward latencies and queue contention.
- Tools/workflows: Queue telemetry and back‑pressure mitigation; buffer size tuned for throughput targets; autoscaling trainers vs. workers.
- Assumptions/dependencies: Existing async design; profiling to size buffers correctly.
- Sampling refinements (pilot scale): Positive‑biased sampling and AsymRE loss
- What to do: Pilot keeping a fraction δ of freshest correct rollouts alongside FIFO entries; test AsymRE to reduce importance‑ratio variance under higher staleness.
- Tools/workflows: Label correct/incorrect rollouts in buffer metadata; enable switchable loss heads (GRPO vs. AsymRE); offline ablations.
- Assumptions/dependencies: Early evidence promising but limited; requires reward labeling quality and careful hyperparameter control.
Long‑Term Applications
These require further research, scaling, or engineering before broad adoption.
- Frontier‑scale RLFT (very large LLMs across domains)
- What to do: Validate replay‑based efficiency/stability at 30–400B+ parameters; extend theory/metrics to long‑context rollouts, complex reward models, and multi‑objective RLHF.
- Tools/workflows: High‑throughput, distributed replay services; memory‑efficient storage (compressed rollouts); adaptive staleness controls; per‑layer off‑policy diagnostics.
- Assumptions/dependencies: New failure modes at scale; reward drift; stronger off‑policy bias corrections.
- Robust off‑policy optimization for sequence models
- What to do: Develop losses and corrections tailored to buffered off‑policy data (e.g., improved variants of AsymRE, tempered importance weights, coupling‑aware estimators).
- Tools/workflows: Library support for off‑policy‑robust RL losses; theoretical guarantees integrated as constraints in auto‑tuning systems.
- Assumptions/dependencies: Better modeling of bias due to sample‑iterate coupling; empirical validation on diverse tasks (dialogue, tool use, program synthesis).
- Auto‑tuning and scheduling systems for compute‑optimal RLFT
- What to do: Build controllers that learn μ online, optimize W/T, N, and replay ratio B/R to maximize accuracy per compute under dynamic cluster conditions.
- Tools/workflows: Bayesian optimization or bandit controllers; reward‑latency predictors; staleness budgets; SLA‑aware schedulers.
- Assumptions/dependencies: Reliable telemetry; safe exploration of hyperparameters without destabilizing training.
- Sector‑specific RLFT with constrained data (healthcare, finance, legal)
- What to do: Apply replay buffers under strict privacy/governance constraints; explore on‑prem or secure enclaves storing encrypted trajectories; differential privacy over stored rollouts.
- Tools/workflows: Encrypted or hashed replay storage; retention policies; DP‑aware sampling; auditing of off‑policy influence.
- Assumptions/dependencies: Regulatory acceptance of temporary trajectory storage; privacy budgets and degradation management.
- Robotics and embodied agents using language‑conditioned policies
- What to do: Extend replay to LLM‑based planners where rollouts (simulation episodes or real‑world trials) are expensive; reuse stored episodes to reduce sim time.
- Tools/workflows: Cross‑modal buffers (text, actions, sensor logs); prioritization for rare successes (HER‑style); sim‑to‑real aware staleness controls.
- Assumptions/dependencies: Alignment between text‑policy evolution and environment dynamics; handling non‑stationarity and long‑horizon credit assignment.
- Agent systems with online learning from user interactions
- What to do: For enterprise assistants/agents, buffer consented user trajectories and replay for periodic RL updates, reducing fresh data collection.
- Tools/workflows: Privacy‑first data pipelines; on‑device or federated buffering; staleness decay rules; opt‑in policies and retention limits.
- Assumptions/dependencies: Consent and governance; bias from over‑reusing past user patterns; guardrails to preserve diversity.
- Standards and policy: Compute/energy reporting and efficiency benchmarks
- What to do: Develop benchmarks and reporting standards that account for inference vs. training compute, staleness/diversity metrics, and accuracy‑per‑compute Pareto frontiers.
- Tools/workflows: Shared evaluation suites; emissions accounting frameworks; procurement guidelines favoring compute‑efficient RLFT.
- Assumptions/dependencies: Community buy‑in; comparable metrics across heterogeneous pipelines.
- Advanced buffer designs and sampling policies
- What to do: Investigate prioritized replay for LLMs (e.g., by reward, gradient norm, uncertainty), uniform‑without‑replacement to boost local diversity, and curriculum‑aware buffering.
- Tools/workflows: Pluggable sampling backends; uncertainty estimators (e.g., ensembles or variance proxies); curriculum schedulers that modulate staleness horizons.
- Assumptions/dependencies: Avoiding bias amplification; scalable indexing and sampling; thorough fairness and safety evaluation.
- Safety and reliability improvements via replay‑aware monitoring
- What to do: Use staleness and diversity signals to detect instability, entropy collapse, or mode dominance; enforce policies that cap staleness or increase diversity dynamically.
- Tools/workflows: Policy entropy monitors; adaptive buffer decay; intervention playbooks that alter W/T or N upon drift.
- Assumptions/dependencies: Reliable early‑warning indicators; low‑latency control loops to prevent collapse without harming progress.
Glossary
- Asynchronous training: Training regime where rollout generation and parameter updates proceed concurrently with potentially stale model copies. "is sometimes referred to as {\em asynchronous training}."
- AsymRE: An off-policy variant of REINFORCE that asymmetrically treats positive and negative rewards to improve stability with stale data. "we replace GRPO with the AsymRE loss"
- back-pressured: Temporarily slowed due to downstream bottlenecks in an asynchronous pipeline. "trainers are temporarily back-pressured."
- bias-variance decomposition: Analytical split of error into bias and variance components to study estimator behavior. "bias-variance decomposition in stochastic gradient descent"
- compute budget: The total amount of compute resource available or used for training. "Given an asymptotically large compute budget "
- compute ratio: The relative per-update compute cost with a buffer versus without a buffer. "We define the compute ratio of a buffer configuration to be"
- DDPG: Deep Deterministic Policy Gradient, an off-policy actor-critic algorithm for continuous control. "and DDPG \citep{lillicrap2015continuous}"
- DQN: Deep Q-Network, a value-based deep reinforcement learning algorithm. "algorithms like DQN \citep{mnih2015human}"
- first-in, first-out buffer: A buffer where the oldest items are evicted first as new items arrive. "a first-in, first-out buffer"
- GRPO: A PPO-style reinforcement learning objective used in LLM fine-tuning pipelines. "methods like PPO or GRPO"
- Hindsight Experience Replay: A replay technique that relabels goals to learn from unsuccessful trajectories. "Hindsight Experience Replay \citep{andrychowicz2017hindsight}"
- importance ratio correction: Weighting off-policy samples by likelihood ratios to correct distribution mismatch. "AsymRE does not feature importance ratio correction"
- importance sampling: A technique to estimate expectations under one distribution using samples from another via reweighting. "While importance sampling corrects the marginal distribution mismatch"
- inference workers: Processes/GPUs dedicated to generating trajectories for training. "the GPUs are often split between inference workers and trainers"
- IQR: Interquartile range, the difference between the 75th and 25th percentiles as a dispersion measure. "We report the median and IQR over $10$ seeds."
- L-smooth: Having Lipschitz-continuous gradients with constant L (smoothness of the objective). "is non-negative, differentiable, and -smooth"
- LLM post-training: The stage after pretraining where LLMs are further trained (e.g., via RL) to improve capabilities. "remains largely unexplored in LLM post-training"
- minibatch: A small subset of samples used to compute a stochastic gradient update. "sample a minibatch of size "
- non-convex stochastic optimization: Optimization of non-convex objectives using noisy gradient estimates. "classical non-convex stochastic optimization framework"
- off-policy: Using data generated by a different policy than the one currently being optimized. "off-policy data"
- off-policy corrections: Adjustments (e.g., importance weighting) to account for distribution mismatch in off-policy learning. "and off-policy corrections"
- off-policy degradation: Performance drop attributed to training on off-policy data. "to avoid off-policy degradation"
- off-policiness: Degree of mismatch or staleness between the data-generating policy and the current policy. "degree of off-policiness"
- off-policiness horizon: The maximum staleness of samples retained in the buffer, often expressed as N/R. "off-policiness horizon"
- Pareto frontier: The set of non-dominated trade-off points between competing objectives. "the Pareto frontier"
- pass@: Evaluation metric indicating success if the correct answer appears within the top k generated attempts. "improving pass@ metrics"
- policy entropy: Entropy of a policy’s action distribution, reflecting output diversity. "while preserving policy entropy."
- policy gradient algorithm: Methods that estimate and ascend the gradient of expected return with respect to policy parameters. "policy gradient algorithm"
- positive-bias sampling: A replay sampling strategy that increases the fraction of correct/positive-reward trajectories. "positive-bias sampling"
- PPO: Proximal Policy Optimization, a policy gradient method with clipped updates for stability. "methods like PPO or GRPO"
- Prioritized Experience Replay: Replay strategy that samples experiences with probability proportional to a priority measure. "Prioritized Experience Replay"
- replay buffer: A storage of past trajectories from which training batches are sampled, enabling reuse. "A replay buffer can be implemented as follows:"
- replay ratio: The average number of times a sample is reused for gradient updates. "The average replay ratio is $1.78$, $3.42$ and $7.0$"
- rollouts: Sequences of states, actions, and rewards generated by running the policy. "rollouts are generated, used for a single gradient update, and immediately discarded."
- sigma-field: A collection of sets defining the measurable events in a probability space (σ-algebra). "the -field associated to the sequence ."
- sharded: Partitioned across multiple machines/threads/processes to distribute storage or computation. "the buffer is sharded across trainers"
- Soft Actor-Critic: An off-policy maximum-entropy actor-critic algorithm. "Soft Actor-Critic \citep{haarnoja2018soft}"
- staleness horizon: The time span over which stale samples are retained and reused. "staleness horizon "
- stationary points: Parameter values where the gradient is zero, indicating potential optima or saddle points. "toward stationary points"
- steps-since-last-use: A local diversity metric counting gradient steps since a sample’s last inclusion in a batch. "steps-since-last-use"
- stochastic gradient descent: Optimization method using gradients computed from random subsets of data. "stochastic gradient descent"
- synchronous setups: Training regime where rollout generation and optimization proceed in lockstep. "Synchronous setups also exist"
- transfer queue: The queue through which generated trajectories are passed from inference to training processes. "usually via a transfer queue."
- wall-time: Real elapsed clock time for training, as opposed to abstract compute units. "than wall-time"
Collections
Sign up for free to add this paper to one or more collections.






