Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 65 tok/s

Gemini 2.5 Pro 40 tok/s Pro

GPT-5 Medium 26 tok/s Pro

GPT-5 High 24 tok/s Pro

GPT-4o 113 tok/s Pro

Kimi K2 200 tok/s Pro

GPT OSS 120B 445 tok/s Pro

Claude Sonnet 4.5 34 tok/s Pro

2000 character limit reached

Duplicating Sampling Policy Optimization (DUPO)

Updated 4 July 2025

DUPO is a reinforcement learning optimization strategy that duplicates and filters batched rollouts to boost high-variance learning signals in complex, long-horizon tasks.
It employs pre-RL data filtering and in-batch duplication to focus training on samples with significant uncertainty and sparse rewards.
DUPO accelerates convergence by achieving 2–3x speedups over traditional methods, improving performance in web-based information-seeking and similar domains.

Duplicating Sampling Policy Optimization (DUPO) is a reinforcement learning (RL) optimization strategy designed to address the challenges of sample efficiency, stable learning, and effective uncertainty reduction in complex, high-dimensional, and long-horizon decision-making tasks. It is particularly influential in domains such as web-based information-seeking agents, where multi-step reasoning and sparse, delayed rewards significantly amplify the difficulty of RL training. DUPO introduces a set of algorithmic mechanisms that adaptively duplicate and filter sampled trajectories or rollouts within a batch, maximizing the use of high-variance, high-signal data and accelerating the overall learning process.

1. Motivation and Problem Setting

In advanced RL applications—especially those involving LLM agents or continuous control in uncertain environments—standard on-policy update schemes are often highly data-inefficient. This inefficiency arises from several factors:

Sparse or delayed rewards: Rewards may be distributed sparsely across long trajectories, making useful learning signals rare.
Expensive or slow environment interaction: Each new batch of trajectories (rollouts) may involve interacting with external systems (e.g., web browsers) or expensive simulators.
Redundant or uninformative samples: Many generated samples or entire mini-batches could be uninformative (e.g., all correct, all failure cases), contributing little to policy improvement.

DUPO was created to address these bottlenecks by both prioritizing “hard” learning cases and ensuring that GPU or computational resources remain maximally utilized even when environment interaction is slow or limited (Li et al., 3 Jul 2025).

2. Principle of Duplicating and Filtering Batches

DUPO innovates upon prior RL fine-tuning schemes—including vanilla PPO and methods such as Dynamic Sampling Policy Optimization (DAPO)—through the following dual mechanisms (Li et al., 3 Jul 2025):

a) Pre-RL Data Filtering:

Before RL optimization begins, DUPO filters out training instances where all generated rollouts are correct (too easy) or all are incorrect (ambiguous/failed). This focuses learning on high-uncertainty or “boundary” tasks where the policy stands to benefit most from updates.

b) In-Batch Duplication:

During RL optimization, if a batch comprises some instances with high decision uncertainty (i.e., standard deviation of returns, $\text{std}(\{R_i\}_{i=1}^G)$ , is nonzero) and some that are trivial (all correct or all wrong), DUPO fills the computational batch by duplicating the nontrivial cases. This multiplexes learning signals, accelerates reward propagation, and prevents computational stalls, as illustrated by 2–3x convergence speedups over prior dynamic sampling schemes in agentic RL (Li et al., 3 Jul 2025).

Functionality	DAPO/Vanilla PPO	DUPO
Easy case filtering	Sequential	Pre-batch, high-throughput
Hard case duplication	Not used/inefficient	In-batch, hardware-optimal
Policy update signal	Diluted by trivial cases	Focused on high-variance
RL rollouts	Slow, environment bound	Maximally utilized GPUs

3. Mathematical Objective and Loss Formulation

DUPO operationalizes its approach using a token-level, clipped PPO objective, modulated by group-relative advantage and strict masking of environment observation tokens. Consider a batch drawn as $(q, y)$ paired with $G$ rollouts $\{o_i\}_{i=1}^G$ from the current policy:

$\mathcal{J}(\theta) = \mathbb{E}_{(q, y), \{o_i\}_{i=1}^G} \left[ \frac{1}{\sum_{i=1}^G |o_i|} \sum_{i=1}^G \sum_{t=1}^{|o_i|} \min \left( r_{i,t}(\theta) \hat{A}_{i,t},~\operatorname{clip}(r_{i,t}(\theta), 1-\varepsilon_\text{low}, 1+\varepsilon_\text{high}) \hat{A}_{i,t} \right) \right]$

Subject to: $0 < \Big|\{o_i: \text{is\_equivalent}(y,o_i)\}\Big| < G$ with: $r_{i,t}(\theta) = \frac{\pi_\theta(o_{i,t}|\text{context})}{\pi_{\theta_\text{old}}(o_{i,t}|\text{context})}$

$\hat{A}_{i,t} = \frac{R_i - \text{mean}(\{ R_j \}_{j=1}^G)}{\text{std}(\{ R_j \}_{j=1}^G)}$

$R_i = 0.1 \times R_i^{\text{format}} + 0.9 \times R_i^{\text{answer}}$

Crucially, loss is computed and backpropagated only over agent output tokens (thoughts/actions), not on copied environment observations.

4. Computational and Sample Efficiency

By duplicating informative instances within a batch and skipping those with zero learning signal, DUPO maximizes GPU utilization per rollout and sharply increases sample efficiency. It enables hardware-limited RL pipelines (e.g., for large language agents operating in the real web or simulation environments) to process more effective learning steps per wall-clock time. This contrast is notable when compared with DAPO, where batch composition is typically bottlenecked by slow environment queries and sequential filtering (Li et al., 3 Jul 2025).

Experiments reported in WebSailor (Li et al., 3 Jul 2025) demonstrate:

2–3x acceleration in convergence on BrowseComp benchmarks versus DAPO/vanilla PPO.
Improved Pass@1/Pass@3 on the most difficult information-seeking environments.
Stable training in the presence of long-horizon sparse rewards, avoiding RL collapse and mode collapse.

5. Uncertainty Reduction and Learning Signal

DUPO’s design focuses policy improvement on examples exhibiting high epistemic uncertainty—those where rollouts disagree in outcome. This prioritization is empirically and theoretically justified:

The group-relative advantage estimator emphasizes learning from “boundary” cases where outcome is not deterministic across the policy’s samples, thus pushing the agent to distinguish fine-grained decision points essential for reducing overall task uncertainty.
By iteratively targeting high-uncertainty tasks, the agent learns to navigate ambiguous web environments, reduces exploration in highly uncertain areas efficiently, and develops superhuman information-synthesis strategies (Li et al., 3 Jul 2025).

6. Implementation Considerations and Trade-offs

Implementing DUPO requires:

Sufficient parallel rollout capability to support meaningful batch sizes $G$ per question.
Grouping of instances according to correct/incorrect rollout statistics, calculation of per-group standard deviation, and dynamic batch filling with duplicates as needed.
Masking constraints on observation tokens to ensure only agent outputs are trained upon, matching correct credit assignment in agentic RL.

A potential trade-off inherent in aggressive duplication is increased temporal correlation in policy updates; however, because only batches with genuine variance are duplicated, the effect on convergence or generalization is empirically negligible.

In highly resource-constrained or online settings, practitioners should calibrate batch sizes and duplication levels according to environment interaction cost and available parallel rollout heads.

7. Impact and Applications

DUPO is identified as the backbone RL optimization routine used in the WebSailor post-training pipeline for LLM-based web agents (Li et al., 3 Jul 2025). Its techniques are crucial for matching and surpassing proprietary agentic systems like DeepResearch in BrowseComp and related benchmarks. The approach is applicable beyond web navigation—for any RL domain where environment interactions are expensive, batch-level learning signals are variable, and fast, stable optimization is essential.

Summary of core DUPO innovations (as described in WebSailor and related works):

Innovation	Purpose	Effect
Pretraining data filtering	Remove trivial cases	Focus on hard/uncertain tasks
In-batch duplication of hard cases	Maximize signal, resource use	Faster, more efficient RL
Group-relative advantage	Sharpen credit assignment	Robust, reliable updates
Masking environment observations	Correct policy credit assignment	Stable agentic RL

8. Theoretical and Practical Extensions

DUPO’s methodology aligns with recent research on active importance sampling, batch-level sample reuse, and principled off-policy evaluation (Papini et al., 9 May 2024, 2512.15458), suggesting its mechanisms are extensible to:

Conservative or safe RL frameworks where high-confidence policy improvement is needed (Daoudi et al., 2023).
Preference-based, DPO-style fine-tuning where response diversity and duplication amplify the gradient signal (Kim et al., 3 Jun 2025).
Hybrid online-offline RL pipelines and long-horizon tasks where traditional sequential batch sampling is infeasible.

Its computational principles also inform the design of future RL frameworks combining dynamic data selection, advanced sample weighting, and batch-level policy optimization.

References

"WebSailor: Navigating Super-human Reasoning for Web Agent" (Li et al., 3 Jul 2025)
PPO: Schulman et al., 2017
DAPO: Yu et al., 2025
"Policy Gradient with Active Importance Sampling" (Papini et al., 9 May 2024)
"Understanding the Impact of Sampling Quality in Direct Preference Optimization" (Kim et al., 3 Jun 2025)
FireAct: Chen et al., 2023