Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
86 tokens/sec
GPT-4o
11 tokens/sec
Gemini 2.5 Pro Pro
52 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

DUPO RL: Duplicating Sampling Policy Optimization

Updated 11 July 2025
  • Duplicating Sampling Policy Optimization (DUPO) is an RL algorithm that dynamically filters and duplicates diverse rollout samples to enhance training efficiency in complex environments.
  • It integrates dual-stage sampling—including pre-training filtering and in-training duplication—to achieve 2–3× speedup and improved convergence over traditional methods.
  • DUPO’s token-level optimization and variance-aware filtering yield stable policy updates, making it ideal for multi-turn web agents facing slow rollouts and sparse rewards.

Duplicating Sampling Policy Optimization (DUPO) is a reinforcement learning (RL) training algorithm developed for efficient and stable policy optimization in complex, long-horizon environments, particularly in the context of web agent training for information-seeking tasks. DUPO integrates two dynamic sampling strategies—one pre-training and one during training—to maximize sample efficiency and accelerate convergence, primarily by strategically duplicating rollout samples that exhibit meaningful variance in outcomes. It is designed to overcome the sample inefficiency and bottlenecks inherent in conventional agent RL, particularly when involving slow environment rollouts and sparse rewards (Li et al., 3 Jul 2025).

1. Objectives and Motivation

DUPO was introduced in the WebSailor system to address the unique challenges encountered in multi-turn, tool-using LLM agents that interact with web environments. Traditional RL rollouts in such domains are slow due to the asynchronous and often expensive interaction with external information sources. Moreover, the sparse reward structure typical in long-horizon, information-seeking tasks further hampers the stability and efficiency of policy optimization.

By integrating a dual-stage dynamic sampling mechanism, DUPO ensures that RL updates focus on high-value, diverse rollout scenarios. This is achieved by:

  • Filtering out trivial cases prior to training (those with little or no informational challenge, e.g., all-correct or all-incorrect rollouts), and
  • Dynamically duplicating, during training, those samples with non-zero standard deviation in predicted outcomes to maintain batch diversity and learning signal strength, thus avoiding the need for sequential environment rollouts for each batch case.

This approach enables DUPO to achieve a 2–3× speedup relative to preceding dynamic-sampling schemes (such as DAPO), while also improving stability and convergence in policy optimization (Li et al., 3 Jul 2025).

2. Mathematical Formulation

The training objective in DUPO is defined at the token-level and adopts a clipped policy-gradient structure inspired by modern PPO variants. Specifically, the objective is:

J(θ)=E(q,y)D,{oi}i=1Gπθold(context){1i=1Goii=1Gt=1oimin(ri,t(θ)A^i,t, clip(ri,t(θ),1ϵlow,1+ϵhigh)A^i,t)}J(\theta) = \mathbb{E}_{(q, y) \sim \mathcal{D},\, \{o_i\}_{i=1}^G \sim \pi_{\theta_{\text{old}}}(\cdot \mid \text{context})} \left\{ \frac{1}{\sum_{i=1}^G |o_i|} \sum_{i=1}^G \sum_{t=1}^{|o_i|} \min \left( r_{i, t}(\theta) \hat{A}_{i, t},\ \mathrm{clip}(r_{i, t}(\theta), 1 - \epsilon_\text{low}, 1 + \epsilon_\text{high}) \hat{A}_{i, t} \right) \right\}

Where:

  • ri,t(θ)=πθ(oi,tcontext)πθold(oi,tcontext)r_{i, t}(\theta) = \frac{\pi_\theta(o_{i,t} \mid \text{context})}{\pi_{\theta_{\text{old}}}(o_{i, t} \mid \text{context})} is the token-level importance sampling ratio,
  • A^i,t=Rimean({Ri}i=1G)std({Ri}i=1G)\hat{A}_{i, t} = \frac{R_i - \text{mean}(\{R_i\}_{i=1}^G)}{\text{std}(\{R_i\}_{i=1}^G)} is the group-relative advantage estimator, with RiR_i denoting the reward for rollout ii,
  • ϵlow,ϵhigh\epsilon_\text{low}, \epsilon_\text{high} are clipping hyperparameters.

This design ensures policy updates are constrained, analogous to PPO, and are computed only for non-trivial (diverse) sample groups.

3. Dynamic Sampling and Duplication Mechanisms

DUPO’s learning batch is constructed via a two-tier sampling process:

  • Pre-Training Filtering: Overly simple or trivial cases—identified as having zero standard deviation in their group rewards (i.e., all rollouts correct/incorrect)—are removed prior to policy update. This pruning eliminates cases that would fail to provide a meaningful learning signal.
  • In-Training Duplication: To prevent batch size shrinkage and maximize computational efficiency, samples from other groups with non-zero reward standard deviation are duplicated to fill the batch as needed. This duplication replaces the slower alternative of requesting fresh rollouts from the environment, thus accelerating the training loop by leveraging challenging cases.

Only samples containing diversity (i.e., potential for policy improvement) are used to drive updates, and their availability is maximized through duplication rather than costly rollout generation.

DUPO distinguishes itself relative to established RL policy optimization methods through its batch construction and gradient estimation procedures:

  • PPO and TRPO rely on trust-region updating and clipping for monotonic improvement, but operate on fixed or dynamically generated episode batches, with no explicit duplication or filtering mechanism (1502.05477).
  • Dynamic Adaptive Policy Optimization (DAPO) utilizes dynamic sampling but fills out batches via additional environment rollouts, potentially incurring added latency.
  • POIS (Metelli et al., 2018) and P3O (Fakoor et al., 2019) manage variance and control off-policy bias via importance sampling and surrogate objectives, but do not include explicit batch sample duplication as a means to speed up RL learning.
  • OMPO (Luo et al., 29 May 2024) and variance regularization approaches (Islam et al., 2022) focus on matching distributional statistics or explicitly penalizing high-variance updates, whereas DUPO’s unique contribution is the combination of group-relative advantage estimation with batch-filling sample duplication.

A comparative summary:

Method Batch Construction Variance Control Off-Policy Reuse Sample Efficiency (per data)
DUPO Dynamic + duplication Group std filter Rollouts only, re-duplication 2–3× over DAPO
PPO/TRPO Fixed/dynamic Clipping/KL No Baseline
DAPO Dynamic + new rollouts N/A New rollouts Slower than DUPO
POIS/P3O Importance Sampling IS penalty/ESS Yes Task dependent
OMPO Buffer/Discriminator Divergence term Yes High (see (Luo et al., 29 May 2024))

5. Advantage Estimation and Loss Calculation

The group-relative advantage estimator is critical to DUPO’s effectiveness in long-horizon, multi-turn domains. By centering and scaling rollout rewards within a group:

  • Only samples with meaningful learning signal (i.e., non-trivial variance in outcome) are retained, with the rest filtered.
  • Token-level optimization is emphasized, and loss computation explicitly masks out environment observation tokens, focusing gradient updates on the generative (agent-driven) parts of the output.

This design is intended to enhance stability and provide strict normalization for cases in which reward signals are distributed very unevenly across the input space.

6. Impact on RL-Based Web Agents and Generalization

DUPO was demonstrated in the context of WebSailor (Li et al., 3 Jul 2025), where it contributed to:

  • Significant acceleration of RL training for web agents handling multi-step, tool-intensive reasoning tasks.
  • Substantial improvements in sample efficiency and stability when training on complex, ambiguous environments, enabling agents to match or exceed proprietary system capabilities on challenging benchmarks (such as BrowseComp).
  • Smoother convergence and more robust policy improvement in the face of sparse or highly variable reward signals.

This suggests that DUPO’s combination of dynamic filtering, sample duplication, and fine-grained (token-level) optimization is particularly advantageous in scenarios where agent–environment interaction is slow, reward is sparse, and stable high-variance signal propagation is essential.

7. Limitations and Practical Considerations

While DUPO achieves speedup and stability in specific settings, its effectiveness may hinge on the availability of sufficiently diverse batch groups; if all samples are trivial or reward variance is persistently low, the benefits of duplication may be diminished. A plausible implication is that for domains where all or most rollouts are low-variance or trivial, batch size or policy improvement per step may be limited. Moreover, hyperparameter selection (clipping bounds, minimum variance thresholds) and careful masking of non-generative tokens are necessary to fully realize the intended benefits.

Summary

Duplicating Sampling Policy Optimization (DUPO) constitutes an advancement in RL training methodology for slow, sparse-reward, and high-variance environments such as multi-turn web agents. By strategically filtering and duplicating diverse, high-value rollout samples within each batch, and optimizing a clipped token-level surrogate objective, DUPO improves both sample efficiency and convergence speed while maintaining learning stability. Its innovations complement, but are distinct from, earlier trust region, importance sampling, and distribution-matching approaches, allowing it to meet the practical demands of large-scale agent RL in modern LLM and information-seeking settings (Li et al., 3 Jul 2025).