Deliberate Practice Policy Optimization (DPPO)
- DPPO is a metacognitive framework that alternates reinforcement learning and supervised fine-tuning to target model weaknesses and enhance embodied intelligence.
- The method employs a metaloop where RL exposes deficits and SFT repairs them, yielding over a 20% performance improvement on key embodied benchmarks.
- It integrates difficulty-aware sampling and replay data to address scarce supervision and prevent catastrophic forgetting in high-dimensional learning environments.
Searching arXiv for the cited DPPO paper and closely related Pelican-VL paper. Deliberate Practice Policy Optimization (DPPO) is a metacognitive training framework for embodied vision-LLMs that alternates between reinforcement learning (RL) and supervised fine-tuning (SFT) in an iterative “Metaloop,” with RL used for weakness revelation and SFT used for weakness refinement and competence expansion (Zhang et al., 20 Nov 2025). In the formulation introduced for Pelican-VL 1.0, DPPO is designed for embodied intelligence under sparse, finite, and expensive supervision, where static post-training is inadequate because “insufficient expert data prevents SFT from generalizing, while the large search space makes RL insufficiently constrained” (Zhang et al., 20 Nov 2025). The framework is presented as a deliberate-practice analogue for multimodal agents: the model interacts, fails in informative ways, identifies failure modes, receives targeted supervision, updates itself, and repeats (Zhang et al., 20 Nov 2025). Empirically, the same work reports that applying DPPO to a Qwen2.5-VL base model to produce Pelican-VL 1.0 yields a 20.3% performance improvement over the base model and surpasses open-source models at the 100B-parameter scale by 10.6% on the paper’s embodied benchmark suite (Zhang et al., 20 Nov 2025).
1. Conceptual definition and problem setting
DPPO was introduced in the context of embodied intelligence, where a model processes multimodal inputs , with denoting visual input and text, and predicts outputs such as action sequences, reasoning traces, or function calls (Zhang et al., 20 Nov 2025). The base supervised formulation is written as
but the paper emphasizes that the apparent simplicity of this objective masks difficult embodied requirements, including affordance reasoning, causal-temporal inference, long-horizon planning, grounding, and human instruction following (Zhang et al., 20 Nov 2025).
The motivating claim is that embodied AI is constrained by two bottlenecks. The first is embodied data scarcity and cost: real-world embodied data is finite, expensive, and difficult to curate even when supplemented by web, simulation, and robot data. The second is algorithmic inefficiency: existing methods remain resource-hungry and do not adaptively focus training on missing capabilities (Zhang et al., 20 Nov 2025). Within that framing, DPPO is proposed as a way to allocate limited training resources to the model’s weak points rather than scaling training uniformly.
The same work argues that neither SFT alone nor RL alone is sufficient in this regime. SFT is described as stable and useful for “static knowledge infusion,” but limited by the scope of expert data; RL can explore, but is unstable in high-dimensional multimodal embodied settings with sparse signals and can collapse when prerequisite knowledge is missing (Zhang et al., 20 Nov 2025). DPPO is therefore defined not as a one-off RL-then-SFT recipe, but as an iterative alternating optimization process that uses RL to expose deficits and SFT to repair them.
A closely related model report, Pelican-VL 1.0, describes the same framework as a “RL-Refine-Diagnose-SFT loop” and as a metacognitive mechanism inspired by human deliberate practice (Zhang et al., 30 Oct 2025). This description is consistent with the more detailed method paper, but the formal procedural specification is given in the latter (Zhang et al., 20 Nov 2025).
2. Metaloop structure and operational cycle
The defining feature of DPPO is the “Metaloop,” an iterative cycle that alternates between competence expansion and skill refinement (Zhang et al., 20 Nov 2025). RL serves as weakness detection or weakness revelation; SFT serves as weakness refinement and competence expansion. This division of labor is central to the framework’s interpretation. RL is not treated solely as reward maximization, and SFT is not treated solely as passive imitation. Instead, each phase has a distinct epistemic role within a closed-loop self-improvement process (Zhang et al., 20 Nov 2025).
Algorithm 1 in the paper specifies each metaloop iteration in three stages. In the RL phase, for each sample , the current policy generates trajectories , a success-rate score is computed, the pair 0 is added to a difficulty-aware buffer 1, an RL dataset 2 is built by difficulty-aware rebalancing, GRPO is applied on minibatches from 3, a task stagnation score 4 is computed, and a weak set 5 is built using stopping and filtering rules (Zhang et al., 20 Nov 2025). In the SFT phase, the method builds 6 from failure cases, 7 from related embodied samples, and 8 from replay data, and then forms
9
The model is then trained with an SFT loss over 0, after which the difficulty buffer 1 is emptied and the next discovery cycle begins (Zhang et al., 20 Nov 2025).
The workflow figures in the paper summarize this procedure as an RL-SFT loop built around rollout logging and difficulty-aware sampling for dynamic data curation (Zhang et al., 20 Nov 2025). Pelican-VL 1.0 describes the same pattern as a metaloop that teaches the model to practice deliberately (Zhang et al., 30 Oct 2025). This suggests that the authors regard the alternation itself, rather than any single optimizer, as the framework’s main contribution.
A practical implication is that training data are no longer static. Rollout-derived failures become supervision targets, weak capability dimensions trigger retrieval of associated embodied data, and general replay data is retained to control forgetting (Zhang et al., 20 Nov 2025). The method paper also mentions a teacher-model-driven path in which weaknesses found by RL can be sent to a stronger teacher such as InternVL 3.5 to generate high-quality reference solutions, which are then distilled via SFT (Zhang et al., 20 Nov 2025).
3. Formal objective and unified preference-learning view
DPPO is formalized with a phase selector 2, so that only one objective is active at iteration 3: 4 This expresses the metaloop as alternating optimization over two dynamically constructed datasets: rollout-derived hard-but-learnable RL data and weakness-targeted SFT data plus replay (Zhang et al., 20 Nov 2025).
The RL component is instantiated with GRPO. The paper writes the policy gradient as
5
where 6 is a normalized reward weight derived from a rule-based scoring function comparing the current policy 7 to a reference policy 8 (Zhang et al., 20 Nov 2025). The paper’s emphasis is that RL is used as exploratory diagnosis: the reward signal is a mechanism for exposing deficits, not merely for maximizing task score.
Each rollout 9 is assigned a composite reward
0
where 1 checks structural validity, including the required reasoning-trace format and final answer, and 2 is a task-specific rule-based reward (Zhang et al., 20 Nov 2025). The paper states that rule-based multi-task rewards cover six objective families: affordance reasoning, counting and distance estimation, causal and temporal reasoning, task success evaluation, task planning, and task prediction (Zhang et al., 20 Nov 2025).
The SFT side is standard negative log-likelihood over expert labels, but DPPO assigns it a new function: not general post-training over a fixed corpus, but targeted remediation using failures, related embodied data, and replay (Zhang et al., 20 Nov 2025).
A notable theoretical claim is that DPPO can be understood as unified preference learning. The paper gives the universal objective
3
where 4 is a dataset of preference samples, 5 is a preference expression, and 6 models that preference under policy 7 (Zhang et al., 20 Nov 2025). In this view, SFT and GRPO differ only in the form of 8 and the associated preference model. For SFT, 9 is a single optimal expert trajectory and
0
which recovers maximum-likelihood training (Zhang et al., 20 Nov 2025). For RL, 1 is a ranking over trajectories, modeled with a Plackett-Luce distribution,
2
with implicit reward
3
yielding a GRPO-style objective under the stated modeling assumptions (Zhang et al., 20 Nov 2025).
This formalization does not establish a numbered theorem with proof, but it does give a coherent explanation for the alternation: SFT learns from positive exemplars and is stable; GRPO learns from comparative preferences over better and worse trajectories and is diagnostic. The synergy claimed by DPPO is that one phase injects competence while the other exposes where competence is missing (Zhang et al., 20 Nov 2025).
4. Difficulty-aware sampling, weakness diagnosis, and stopping criteria
A central algorithmic component in DPPO is difficulty-aware sampling. For each sample 4, the success rate is computed as
5
where 6 is the outcome of the 7-th rollout (Zhang et al., 20 Nov 2025). The paper interprets 8 as mastered, 9 as complete failure that is often too hard for productive RL, and 0 as partially learned and especially informative (Zhang et al., 20 Nov 2025). The RL dataset is then rebalanced by discarding all samples with 100% success rate and capping the number of complete failures so that they do not exceed the number of partial-success samples (Zhang et al., 20 Nov 2025).
This mechanism places DPPO near hard-example mining and curriculum learning, but in a closed-loop form rather than as a fixed curriculum. The paper explicitly draws connections to hard-example mining, curriculum learning, and active data selection, while distinguishing DPPO from static coreset selection by making the sampling process capability-dependent and rollout-conditioned (Zhang et al., 20 Nov 2025).
The RL stopping criterion is also formalized. First, the success-rate change is defined as
1
with 2. Then the stagnation score for a sample is
3
and the task-level stagnation score is
4
The RL stage terminates automatically when
5
The intuition given in the paper is that always-solved or always-failed cases provide little learning signal, and cases showing no progress should not continue consuming compute (Zhang et al., 20 Nov 2025).
At the end of RL, all examples with 6 are sent into 7 for SFT, which the paper presents as the formal bridge from weakness revelation to weakness refinement (Zhang et al., 20 Nov 2025). Pelican-VL 1.0 gives a compatible description in terms of task saturation and RL stopping at a threshold of 8, although the more detailed derivation is in the method paper (Zhang et al., 30 Oct 2025).
The appendix-level data curation description adds operational detail. After RL, the authors run multiple rounds of rollout inference on the SFT pool, use rule-based filtering to identify weakness samples, unify formats, and use Qwen3VL-Plus and InternVL3.5-38B for scoring and voting, with random human review for quality control (Zhang et al., 20 Nov 2025). This indicates that “weakness” is not defined as epistemic uncertainty alone, but as a combination of rollout failure, low success rate, and lack of progress under continued optimization.
5. Training process, model context, and data organization
DPPO is demonstrated on Pelican-VL 1.0, a family of embodied vision-LLMs built on Qwen2.5-VL and released in sizes from 7B to 72B (Zhang et al., 20 Nov 2025, Zhang et al., 30 Oct 2025). The method paper describes Pelican-VL as a vision-language embodied model whose outputs are best understood as reasoning traces, action or planning sequences, function-call-style outputs, or structured answers, rather than low-level continuous motor commands (Zhang et al., 20 Nov 2025). This is significant because DPPO is an embodied-brain training framework rather than an end-to-end torque-control method.
The broader training corpus is organized into four capability areas: physical/spatial/numerical reasoning; perception/grounding/multi-object consistency; temporal/functional/scene understanding; and decision making/task planning (Zhang et al., 20 Nov 2025). The data curation section reports a total pool of 231M images, 29k hours of video, 231M QA pairs, 9M grounding annotations, and 2M MCQs, from which 1.3M instances are selected for SFT and 0.5M for RL, totaling about 4B training tokens (Zhang et al., 20 Nov 2025). For Pelican-VL 7B specifically, the paper states 200K SFT instances and 194K RL instances (Zhang et al., 20 Nov 2025).
The temporal setup itself follows a curriculum across metaloops. The paper reports three metaloops total; the first uses videos shorter than 32s, the second relaxes to 64s, with up to 32 frames sampled per episode and RL rollout sequences of up to 16 time steps (Zhang et al., 20 Nov 2025). This progression functions as a horizon curriculum from shorter to longer temporal reasoning, though the paper does not formalize it under a separate curriculum-learning objective.
To preserve general capabilities, DPPO incorporates replay data into 9. The paper describes a replay pool based on natural-world video QA: SpatialVID with expert labels removed, 24 QA pairs per video for 75k videos generated by Qwen3VL-Plus, filtered by InternVL3.5 consistency checks to yield 14k QAs, then augmented with 19k QA videos from InternSpatial (Zhang et al., 20 Nov 2025). This replay component is explicitly described as anti-forgetting data.
Pelican-VL 1.0 reports substantial compute and infrastructure scale: 1000+ A800 GPUs, over 50k+ A800 GPU-hours per checkpoint, and adaptation of the VERL framework to support 72B-scale mixed-modal RL training with Context Parallelism and heterogeneous multimodal batching (Zhang et al., 30 Oct 2025). The method paper itself focuses less on systems engineering, but the model report clarifies that DPPO is deployed as a large-scale training recipe rather than as a small proof-of-concept (Zhang et al., 30 Oct 2025).
6. Empirical results, ablations, and limitations
The most prominent quantitative claim is that Pelican-VL 1.0 72B, trained with DPPO, achieves a 20.3% performance improvement over its base model and surpasses open-source models at the 100B-parameter scale by 10.6% (Zhang et al., 20 Nov 2025). In the benchmark table cited by the paper, Pelican-VL 72B reaches an average of 63.8 across the embodied suite, compared with 57.7 for the best open-source non-Pelican competitor under 100B parameters, a 6.1-point absolute gap corresponding to the stated 10.6% relative improvement (Zhang et al., 20 Nov 2025). The reported benchmark suite includes MVBench, RefSpatialBench, VSI-Bench, EgoSchema, Where2Place, COSMOS, RoboSpatial, BLINK, PhyX, OmniSpatial, EmbSpatialBench, and ERQA (Zhang et al., 20 Nov 2025).
The most direct ablation is the 7B comparison under identical data budget. Table 2 reports:
- Base: 33.5 average
- RL only: 40.7
- SFT only: 39.9
- DPPO: 51.0
The paper emphasizes that alternating RL and SFT is much stronger than either alone, especially on hard tasks such as EgoSchema, Where2Place, OmniSpatial, RefSpatial, and VSI-Bench (Zhang et al., 20 Nov 2025). This is the clearest evidence that the alternation itself matters, rather than merely the presence of both losses somewhere in training.
A second important result concerns catastrophic forgetting. The authors report that DPPO gives +15.8% over RL in performance gain while showing less drop on unseen datasets, and on MMStar the degradation is 1.9 for DPPO, 5.0 for SFT, and 24.8 for RL (Zhang et al., 20 Nov 2025). This supports the function of replay data 0 and the stabilizing effect of interleaving SFT with RL.
The paper also includes a difficulty-aware sampling ablation. Replacing difficulty-aware sampling with random sampling destabilizes the loop, and by the second RL stage the model collapses to outputting only final answers and loses chain-of-thought capability (Zhang et al., 20 Nov 2025). This result is particularly relevant because it shows that DPPO’s closed-loop data selection is not incidental; it is central to the training dynamics the paper is trying to induce.
Stage-wise effects are reported over three metaloops. Performance generally improves from stage to stage, and the third stage especially benefits chain-of-thought reasoning and generalization even when direct-answer metrics plateau (Zhang et al., 20 Nov 2025). The authors interpret this as deeper refinement rather than simple benchmark overfitting.
The paper presents several limitations, some explicit and some implicit. Rule-based reward engineering remains important because RL depends on carefully designed format and task rewards (Zhang et al., 20 Nov 2025). Teacher and filtering quality also matter, since weakness refinement depends on stronger models, voting, and some human review (Zhang et al., 20 Nov 2025). The empirical demonstration is primarily on benchmarked embodied VLM reasoning tasks rather than a fully closed-loop low-level robotic control stack (Zhang et al., 20 Nov 2025). Evaluation itself remains a challenge because existing benchmarks are imbalanced and often black-box, which complicates claims about broad embodied generality (Zhang et al., 20 Nov 2025). Pelican-VL 1.0 explicitly describes the current system as a “first step” toward a broader self-evolving embodied ecosystem with autonomous DPPO and closed hardware loop integration (Zhang et al., 30 Oct 2025).
A recurrent misconception is that DPPO is merely “RL plus SFT.” The method paper argues against that interpretation by assigning distinct functions to the phases and by formalizing the loop as repeated weakness detection, targeted data construction, and competence consolidation (Zhang et al., 20 Nov 2025). Another possible confusion arises from acronym overlap: several later papers use “DPPO” for “Divergence Proximal Policy Optimization,” “Dynamic Pruning Policy Optimization,” or related PPO-family variants in other domains, but those are separate methods and not part of Deliberate Practice Policy Optimization (Qi et al., 4 Feb 2026, Zhu et al., 4 Mar 2026).
DPPO is therefore best understood as a deliberate-practice post-training framework for embodied VLMs. Its novelty lies not in a single optimizer, but in a closed-loop training design that operationalizes weakness discovery through RL and weakness repair through SFT under a unified preference-learning interpretation (Zhang et al., 20 Nov 2025). In the Pelican-VL instantiation, this design is presented as a systematic response to embodied data scarcity and algorithmic inefficiency, and the reported results suggest that the alternation can substantially improve both capability and retention at open-source scale (Zhang et al., 20 Nov 2025, Zhang et al., 30 Oct 2025).