DUPO RL: Duplicating Sampling Policy Optimization

Updated 11 July 2025

Duplicating Sampling Policy Optimization (DUPO) is an RL algorithm that dynamically filters and duplicates diverse rollout samples to enhance training efficiency in complex environments.
It integrates dual-stage sampling—including pre-training filtering and in-training duplication—to achieve 2–3× speedup and improved convergence over traditional methods.
DUPO’s token-level optimization and variance-aware filtering yield stable policy updates, making it ideal for multi-turn web agents facing slow rollouts and sparse rewards.

Duplicating Sampling Policy Optimization (DUPO) is a @@@@1@@@@ (RL) training algorithm developed for efficient and stable policy optimization in complex, long-horizon environments, particularly in the context of web agent training for information-seeking tasks. DUPO integrates two dynamic sampling strategies—one pre-training and one during training—to maximize sample efficiency and accelerate convergence, primarily by strategically duplicating rollout samples that exhibit meaningful variance in outcomes. It is designed to overcome the sample inefficiency and bottlenecks inherent in conventional agent RL, particularly when involving slow environment rollouts and sparse rewards (Li et al., 3 Jul 2025).

1. Objectives and Motivation

DUPO was introduced in the WebSailor system to address the unique challenges encountered in multi-turn, tool-using LLM agents that interact with web environments. Traditional RL rollouts in such domains are slow due to the asynchronous and often expensive interaction with external information sources. Moreover, the sparse reward structure typical in long-horizon, information-seeking tasks further hampers the stability and efficiency of policy optimization.

By integrating a dual-stage dynamic sampling mechanism, DUPO ensures that RL updates focus on high-value, diverse rollout scenarios. This is achieved by:

Filtering out trivial cases prior to training (those with little or no informational challenge, e.g., all-correct or all-incorrect rollouts), and
Dynamically duplicating, during training, those samples with non-zero standard deviation in predicted outcomes to maintain batch diversity and learning signal strength, thus avoiding the need for sequential environment rollouts for each batch case.

This approach enables DUPO to achieve a 2–3× speedup relative to preceding dynamic-sampling schemes (such as DAPO), while also improving stability and convergence in policy optimization (Li et al., 3 Jul 2025).

2. Mathematical Formulation

The training objective in DUPO is defined at the token-level and adopts a clipped policy-gradient structure inspired by modern PPO variants. Specifically, the objective is:

$J(\theta) = \mathbb{E}_{(q, y) \sim \mathcal{D},\, \{o_i\}_{i=1}^G \sim \pi_{\theta_{\text{old}}}(\cdot \mid \text{context})} \left\{ \frac{1}{\sum_{i=1}^G |o_i|} \sum_{i=1}^G \sum_{t=1}^{|o_i|} \min \left( r_{i, t}(\theta) \hat{A}_{i, t},\ \mathrm{clip}(r_{i, t}(\theta), 1 - \epsilon_\text{low}, 1 + \epsilon_\text{high}) \hat{A}_{i, t} \right) \right\}$

Where:

$r_{i, t}(\theta) = \frac{\pi_\theta(o_{i,t} \mid \text{context})}{\pi_{\theta_{\text{old}}}(o_{i, t} \mid \text{context})}$ is the token-level importance sampling ratio,
$\hat{A}_{i, t} = \frac{R_i - \text{mean}(\{R_i\}_{i=1}^G)}{\text{std}(\{R_i\}_{i=1}^G)}$ is the group-relative advantage estimator, with $R_i$ denoting the reward for rollout $i$ ,
$\epsilon_\text{low}, \epsilon_\text{high}$ are clipping hyperparameters.

This design ensures policy updates are constrained, analogous to PPO, and are computed only for non-trivial (diverse) sample groups.

3. Dynamic Sampling and Duplication Mechanisms

DUPO’s learning batch is constructed via a two-tier sampling process:

Pre-Training Filtering: Overly simple or trivial cases—identified as having zero standard deviation in their group rewards (i.e., all rollouts correct/incorrect)—are removed prior to policy update. This pruning eliminates cases that would fail to provide a meaningful learning signal.
In-Training Duplication: To prevent batch size shrinkage and maximize computational efficiency, samples from other groups with non-zero reward standard deviation are duplicated to fill the batch as needed. This duplication replaces the slower alternative of requesting fresh rollouts from the environment, thus accelerating the training loop by leveraging challenging cases.

Only samples containing diversity (i.e., potential for policy improvement) are used to drive updates, and their availability is maximized through duplication rather than costly rollout generation.

DUPO distinguishes itself relative to established RL policy optimization methods through its batch construction and gradient estimation procedures:

PPO and TRPO rely on trust-region updating and clipping for monotonic improvement, but operate on fixed or dynamically generated episode batches, with no explicit duplication or filtering mechanism (Schulman et al., 2015).
Dynamic Adaptive Policy Optimization (DAPO) utilizes dynamic sampling but fills out batches via additional environment rollouts, potentially incurring added latency.
POIS (Metelli et al., 2018) and P3O (Fakoor et al., 2019) manage variance and control off-policy bias via importance sampling and surrogate objectives, but do not include explicit batch sample duplication as a means to speed up RL learning.
OMPO (Luo et al., 29 May 2024) and variance regularization approaches (Islam et al., 2022) focus on matching distributional statistics or explicitly penalizing high-variance updates, whereas DUPO’s unique contribution is the combination of group-relative advantage estimation with batch-filling sample duplication.

A comparative summary:

Method	Batch Construction	Variance Control	Off-Policy Reuse	Sample Efficiency (per data)
DUPO	Dynamic + duplication	Group std filter	Rollouts only, re-duplication	2–3× over DAPO
PPO/TRPO	Fixed/dynamic	Clipping/KL	No	Baseline
DAPO	Dynamic + new rollouts	N/A	New rollouts	Slower than DUPO
POIS/P3O	Importance Sampling	IS penalty/ESS	Yes	Task dependent
OMPO	Buffer/Discriminator	Divergence term	Yes	High (see (Luo et al., 29 May 2024))

5. Advantage Estimation and Loss Calculation

The group-relative advantage estimator is critical to DUPO’s effectiveness in long-horizon, multi-turn domains. By centering and scaling rollout rewards within a group:

Only samples with meaningful learning signal (i.e., non-trivial variance in outcome) are retained, with the rest filtered.
Token-level optimization is emphasized, and loss computation explicitly masks out environment observation tokens, focusing gradient updates on the generative (agent-driven) parts of the output.

This design is intended to enhance stability and provide strict normalization for cases in which reward signals are distributed very unevenly across the input space.

6. Impact on RL-Based Web Agents and Generalization

DUPO was demonstrated in the context of WebSailor (Li et al., 3 Jul 2025), where it contributed to:

Significant acceleration of RL training for web agents handling multi-step, tool-intensive reasoning tasks.
Substantial improvements in sample efficiency and stability when training on complex, ambiguous environments, enabling agents to match or exceed proprietary system capabilities on challenging benchmarks (such as BrowseComp).
Smoother convergence and more robust policy improvement in the face of sparse or highly variable reward signals.

This suggests that DUPO’s combination of dynamic filtering, sample duplication, and fine-grained (token-level) optimization is particularly advantageous in scenarios where agent–environment interaction is slow, reward is sparse, and stable high-variance signal propagation is essential.

7. Limitations and Practical Considerations

While DUPO achieves speedup and stability in specific settings, its effectiveness may hinge on the availability of sufficiently diverse batch groups; if all samples are trivial or reward variance is persistently low, the benefits of duplication may be diminished. A plausible implication is that for domains where all or most rollouts are low-variance or trivial, batch size or policy improvement per step may be limited. Moreover, hyperparameter selection (clipping bounds, minimum variance thresholds) and careful masking of non-generative tokens are necessary to fully realize the intended benefits.

Summary

Duplicating Sampling Policy Optimization (DUPO) constitutes an advancement in RL training methodology for slow, sparse-reward, and high-variance environments such as multi-turn web agents. By strategically filtering and duplicating diverse, high-value rollout samples within each batch, and optimizing a clipped token-level surrogate objective, DUPO improves both sample efficiency and convergence speed while maintaining learning stability. Its innovations complement, but are distinct from, earlier trust region, importance sampling, and distribution-matching approaches, allowing it to meet the practical demands of large-scale agent RL in modern LLM and information-seeking settings (Li et al., 3 Jul 2025).