DAPO++: Enhanced DAPO Variants in RLVR

Updated 4 July 2026

DAPO++ is a descriptor for enhanced DAPO-family methods that integrate localized improvements such as refined token credit assignment, causal regularization, and accelerated rollout generation.
It addresses DAPO’s limitations by targeting coarse learning signals, overconfident error propagation, and insufficient diversity in credit assignment.
Variants like ACE-DAPO, CES, and GCPO demonstrate actionable improvements with measurable performance gains in reasoning-focused RLVR tasks.

“DAPO++” is not a standardized algorithm name in the cited literature. No cited paper defines a method explicitly called DAPO++. Instead, the label appears as an informal way to describe a stronger DAPO-family variant in the RLVR literature: a method that preserves DAPO’s rollout-grouped, PPO/GRPO-style optimization backbone and then augments a localized component such as negative-advantage computation, token-level entropy shaping, token credit assignment, or causal regularization. In that sense, “DAPO++” functions less as a canonical method name than as a compact descriptor for DAPO-compatible upgrades such as ACE-DAPO, CES, CFPO $_D$ , F-DAPO, DAPO+KTAE, or GCPO-on-DAPO (Xu et al., 24 Feb 2026, Wei et al., 19 May 2026, Yu et al., 22 Jun 2026).

1. DAPO as the reference backbone in RLVR

Within reasoning-oriented RLVR, DAPO is described as a GRPO-family refinement for long chain-of-thought training. The papers cited here characterize it by a clustered set of design choices rather than by one single innovation: clip-higher or asymmetric clipping, dynamic sampling, and token-level or token-mean loss aggregation. One account describes DAPO as improving GRPO by “introducing a clip higher strategy to mitigate entropy collapse and employing token-mean normalization to compute rewards over token-level averages,” while another describes DAPO as removing the reference KL penalty, introducing decoupled clipping, using dynamic sampling, and using token-level policy gradient loss (Zhai et al., 5 Feb 2026, Yu et al., 22 Jun 2026).

A second recurring feature is that DAPO remains fundamentally a group-relative method. It samples multiple responses for a prompt, uses verifier rewards, and builds a group-normalized advantage. In CES, DAPO is the unchanged base objective into which reshaped token advantages are inserted; in ACE, DAPO is the unchanged pipeline except for the negative-advantage computation; in CFPO $_D$ , DAPO is the host optimizer to which counterfactual regularization is added (Wei et al., 19 May 2026, Xu et al., 24 Feb 2026, Yu et al., 22 Jun 2026).

Dynamic sampling is especially important in this literature. CES states that standard DAPO discards all-correct or all-incorrect groups because sequence-level group-relative advantage would be zero, while CFPO gives the explicit DAPO-style condition

$0 < |\{o_i \mid is\_equivalent(\alpha, o_i)\}| < G.$

This makes DAPO a natural base for “DAPO++”-style work: its core rollout-grouped structure is stable enough that later methods can alter one module without rebuilding the training stack (Wei et al., 19 May 2026, Yu et al., 22 Jun 2026).

2. Recurring limitations that motivate stronger DAPO variants

The papers do not agree on a single failure mode, but they converge on the view that plain DAPO remains too coarse in where and how it assigns learning signal. ACE argues that GRPO-style RLVR, including DAPO at the rollout level, uniformly penalizes all wrong rollouts, allowing overconfident errors to persist as “probability sinks” or “value traps.” CES argues that DAPO propagates one normalized advantage to all tokens and therefore cannot distinguish uncertain bottleneck tokens on correct trajectories from equally uncertain tokens on incorrect ones. GCPO and KTAE sharpen the same criticism further: DAPO and related methods use sample-level rewards yet broadcast a uniform credit signal to all tokens, even though only a subset of tokens are semantically decisive (Xu et al., 24 Feb 2026, Wei et al., 19 May 2026, Li et al., 28 May 2026, Sun et al., 22 May 2025).

A second line of criticism concerns coverage and diversity. F-GRPO shows that practical group sizes can produce active updates that still miss rare-correct modes, so the policy may learn the obvious and forget the rare. ACE links a related phenomenon to overconfident incorrect trajectories that absorb probability mass. Both arguments point toward the same large- $k$ symptom: better sharpness at small $k$ , weaker preservation of broad reasoning coverage at large $k$ (Plyusov et al., 6 Feb 2026, Xu et al., 24 Feb 2026).

A third line of criticism concerns gradient weighting and normalization. REAL diagnoses “Gradient Misassignment in Positives” and “Gradient Domination in Negatives,” arguing that reward-weighted importance-ratio updates allocate gradient in the wrong places. “ $\Delta L$ Normalization” argues that DAPO’s length normalization is biased and high-variance under dynamic generation lengths. These papers treat the problem not as a minor engineering flaw but as a structural issue in how DAPO-family objectives weight trajectories and tokens (Zhai et al., 5 Feb 2026, He et al., 9 Sep 2025).

3. DAPO-compatible enhancement layers

The strongest “DAPO++” interpretations are methods that leave the DAPO backbone intact and modify a local component.

Variant	Local change	Reported effect
ACE-DAPO (Xu et al., 24 Feb 2026)	Replaces only the negative advantage with confidence-aware penalization of overconfident errors	Pass@32 improves over DAPO by +1.2 to +1.7 pp
CES (Wei et al., 19 May 2026)	Reshapes token advantages using correctness-conditioned entropy on selected high-entropy tokens; DAPO objective stays	69.6 / 2376 to 72.1 / 1965 on 12 math benchmarks
CFPO $_D$ (Yu et al., 22 Jun 2026)	Adds counterfactual regularization $+\gamma KL_{cf}$ and entropy stabilization to DAPO	Overall average 55.60 to 58.49
F-DAPO (Plyusov et al., 6 Feb 2026)	Scales group-relative advantage by $g(x)=(1-\widehat{\mu}_{\mathrm{pos}}(x))^\gamma$	Qwen2.5-7B in-domain avg 39.4 / 69.3 to 40.5 / 72.5
DAPO+KTAE (Sun et al., 22 May 2025)	Replaces uniform per-token advantage with model-free key-token advantage estimation	Outperforms baseline methods across five mathematical reasoning benchmarks
GCPO on DAPO (Li et al., 28 May 2026)	Uses positive/negative prompt contrast to form token-level advantages	Qwen3-VL-Instruct-8B: 76.5 / 44.4 / 75.1 / 60.0 / 50.5 to 84.1 / 56.9 / 81.3 / 63.0 / 55.3
SRT with DAPO (Chang et al., 14 Jan 2026)	Changes rollout generation only via speculative decoding with tree-structured cache	DAPO generation time 44.1 s to 31.5 s; step time 81.7 s to 68.7 s

These variants differ in the locus of intervention. ACE is a rollout-level change to negative-sample treatment. CES is a token-level uncertainty controller that leaves the DAPO clipped objective unchanged but replaces the token advantage. CFPO $_D$ 0 is a regularization layer that adds counterfactual visual grounding pressure. F-DAPO is prompt-level reweighting intended to protect rare-correct modes. KTAE and GCPO both address fine-grained token credit assignment, but KTAE does so with within-group token statistics and no extra model, whereas GCPO uses a prompt-contrastive forward computation. SRT is orthogonal: it preserves DAPO’s math and accelerates rollout generation rather than changing credit assignment or reward shaping (Xu et al., 24 Feb 2026, Wei et al., 19 May 2026, Yu et al., 22 Jun 2026, Plyusov et al., 6 Feb 2026, Sun et al., 22 May 2025, Li et al., 28 May 2026, Chang et al., 14 Jan 2026).

This diversity is central to the meaning of “DAPO++”. The term does not denote one fixed algorithmic delta. It usually denotes a localized strengthening of DAPO at one of three sites: advantage computation, regularization, or rollout systems.

4. Objective replacements and generalizations beyond additive DAPO upgrades

Some papers go further and propose methods that are not merely DAPO-compatible layers, but stronger replacements or unifying generalizations.

REAL reformulates RLVR from a classification perspective. Rewards become categorical labels rather than scalar weights, rollout scores become length-normalized log-ratio logits, and training minimizes an anchored classification loss instead of a DAPO/GRPO-style reward-weighted surrogate. On DeepScaleR-Preview-Dataset, REAL reports average Pass@1 0.526 versus DAPO 0.459 for the 1.5B model, and 0.632 versus 0.570 for 7B (Zhai et al., 5 Feb 2026).

“ $_D$ 1-GRPO” argues that GRPO, DAPO, and Dr. GRPO are all instances of one aggregation family parameterized by a sample-level weight $_D$ 2, with DAPO as the fixed case $_D$ 3. Its contribution is to learn a scalar $_D$ 4 controlling token preference rather than hard-coding DAPO’s token aggregation rule. It reports Qwen2.5 averages of 37.8 vs 36.5 at 1.5B, 43.8 vs 42.6 at 3B, and 53.5 vs 51.9 at 7B relative to DAPO (Wang et al., 8 Oct 2025).

“ $_D$ 5 Normalization” treats the problem as one of minimum-variance unbiased estimation under dynamic generation lengths. It writes DAPO aggregation as

$_D$ 6

and argues that this yields a length-dependent expectation and high coefficient of variation. Its replacement uses inverse-length weighting

$_D$ 7

with $_D$ 8 as the minimum-variance case under the paper’s assumptions. On Qwen2.5-7B Math weighted average, it reports 0.592 versus DAPO Norm 0.578 (He et al., 9 Sep 2025).

These methods are not merely “DAPO with one extra term.” They challenge the weighting, normalization, or reward semantics of the DAPO family itself. In that stricter sense, they are often better described as post-DAPO generalizations than as DAPO++ layers.

5. Theoretical lenses on DAPO-family behavior

A recent unification result states that GRPO, Dr. GRPO, and DAPO are “three operations on one number,” the group reward standard deviation. For binary rewards, the per-prompt GRPO update satisfies

$_D$ 9

with $0 < |\{o_i \mid is\_equivalent(\alpha, o_i)\}| < G.$ 0 when $0 < |\{o_i \mid is\_equivalent(\alpha, o_i)\}| < G.$ 1. In this view, GRPO divides by $0 < |\{o_i \mid is\_equivalent(\alpha, o_i)\}| < G.$ 2, Dr. GRPO removes that division, and DAPO discards the $0 < |\{o_i \mid is\_equivalent(\alpha, o_i)\}| < G.$ 3 groups. The immediate implication is that split groups teach most, while unanimous groups are silent (Bay et al., 30 Jun 2026).

Other theories isolate different pathologies. ACE defines a confidence shift

$0 < |\{o_i \mid is\_equivalent(\alpha, o_i)\}| < G.$ 4

and shows that its added gradient can be interpreted as a selective reverse-KL-style regularizer restricted to overconfident incorrect trajectories, with a tempered stop-gradient residual (Xu et al., 24 Feb 2026). REAL proves that its classification-style update gives bounded, monotone gradient weighting,

$0 < |\{o_i \mid is\_equivalent(\alpha, o_i)\}| < G.$ 5

thereby countering the positive-misassignment and negative-domination behavior attributed to GRPO/DAPO-style reward weighting (Zhai et al., 5 Feb 2026). “ $0 < |\{o_i \mid is\_equivalent(\alpha, o_i)\}| < G.$ 6 Normalization” shows that, under $0 < |\{o_i \mid is\_equivalent(\alpha, o_i)\}| < G.$ 7, inverse-length weighting is the unique minimum-variance solution among linear unbiased estimators (He et al., 9 Sep 2025).

F-GRPO contributes a complementary finite-group analysis. Its exact probability for an active update that still misses a rare-correct subset is

$0 < |\{o_i \mid is\_equivalent(\alpha, o_i)\}| < G.$ 8

which formalizes why practical group sizes can remain mixed yet still reinforce only common correct solutions (Plyusov et al., 6 Feb 2026).

Taken together, these theories imply that “DAPO++” work is usually about where learning signal should concentrate: on disagreement-rich groups, overconfident wrong trajectories, rare-correct prompts, hard positives, or key tokens.

6. Acronym ambiguity and scope outside RLVR

The term is further complicated by the fact that DAPO is not unique to RLVR. In dialogue modeling, DAPO denotes Dialogue-adaptive Pre-training Objectives, an ELECTRA-large-based quality-regression method over dialogue coherence corruption and 3-NIDF rescoring (Li et al., 2020). In high-level synthesis, DAPO denotes Design Structure-Aware Pass Ordering, which combines heterogeneous IR graphs, contrastive learning, Light-HLS estimation, and PPO, and reports an average $0 < |\{o_i \mid is\_equivalent(\alpha, o_i)\}| < G.$ 9 speedup over Vitis HLS on pragma-annotated designs (Ge et al., 12 Dec 2025). In code-editing systems, DAPO denotes Dynamic sAmpling Policy Optimization as a post-SFT RL refinement stage; the deployed Qwen3-4B+SFT+DAPO NES model reports 75.6% and 81.6% accuracy for two next-edit location tasks and 91.36% ES with 27.7% EMR for edit generation (Chen et al., 4 Aug 2025).

There is also a distinct diffusion-inverse-problems method named DAPS++, and that paper explicitly states that “DAPO++” is very likely a typo or confusion with DAPS++ rather than the name of a separate method in that literature (Chen et al., 21 Nov 2025).

This suggests that “DAPO++” should not be read as a canonical standalone algorithm unless the surrounding paper defines it explicitly. In current arXiv usage, the phrase most often denotes either an informal RLVR shorthand for a strengthened DAPO-family method or a nomenclature ambiguity that must be resolved from context.