Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 67 tok/s
Gemini 2.5 Pro 51 tok/s Pro
GPT-5 Medium 21 tok/s Pro
GPT-5 High 32 tok/s Pro
GPT-4o 120 tok/s Pro
Kimi K2 166 tok/s Pro
GPT OSS 120B 446 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

Demo-SCORE: Autonomous Demo Curation

Updated 13 September 2025
  • Demo-SCORE is a methodology for robot learning that leverages trial outcomes to autonomously score and filter heterogeneous demonstration datasets.
  • The two-stage framework trains an initial policy to generate rollouts and a classifier to discern successful versus suboptimal trajectories, leading to improved policy performance.
  • By automating the curation of demonstration data, Demo-SCORE significantly enhances imitation learning, with noted success improvements in both simulated and real-world tasks.

Demo-SCORE refers to "Demonstration Self-Curated via Online Rollout Experience," a methodology for robot learning that leverages the robot’s own trial outcomes to autonomously score and curate heterogeneous demonstration datasets. This approach specifically addresses the problem of mixed-quality demonstrations—which can be detrimental to robot policy optimization—by using a learned classifier to filter out suboptimal examples before final imitation learning. Developed for both simulation and real-world robotic settings, Demo-SCORE has been shown to dramatically improve final task success rates compared to using all available demonstrations indiscriminately (Chen et al., 5 Mar 2025).

1. Motivation and Heterogeneous Demonstration Datasets

Robotic imitation learning often begins with a collection of human or autonomous demonstrations. However, datasets typically contain varied strategies and levels of execution quality, some of which may be unreliable or underrepresented. Subpar or infrequent strategies—if included in the training data—can cause the resulting policy to malfunction when such behaviors surface at test time. Manual curation by humans is costly and challenging, especially when unreliable strategies are subtle or require expert judgement to detect. Demo-SCORE was proposed to automate this curation process based on the robot’s empirical policy performance when trained with the heterogeneous dataset.

2. Methodological Overview

Demo-SCORE consists of a two-stage framework:

  1. Policy Training and Rollout Collection:
    • An initial policy π0\pi_0 is trained via imitation learning using the full set of demonstrations Ddemo\mathcal{D}_{demo}.
    • Rollouts are executed by π0\pi_0 across several training checkpoints, each outcome (trajectory) labeled as success or failure.
    • These rollout datasets Dπ0,1,...,Dπ0,C\mathcal{D}_{\pi_0,1}, ..., \mathcal{D}_{\pi_0,C} are used to reflect "how reliably the policy imitates the demos."
  2. Quality Classifier Training and Cross-Validation:
    • A classifier qϕq_\phi (commonly a small MLP) is trained to distinguish trajectory (or state) success from failure.
    • To avoid overfitting to any specific rollout distribution, cross-validation is conducted: train qϕq_\phi on rollouts from certain checkpoints; validate on others.
    • The best qϕq_\phi^* is selected as the one minimizing binary cross-entropy loss on the held-out set.
  3. Demonstration Scoring and Filtering:
    • The classifier is applied to each demonstration τ\tau; an average score qˉϕ(τ)\bar{q}_{\phi^*}(\tau) is computed:

    qˉϕ(τ)=1Tt=1Tqϕ(y=1st)\bar{q}_{\phi^*}(\tau) = \frac{1}{T} \sum_{t=1}^T q_{\phi^*}(y=1|s_t)

  • A threshold γ\gamma is established (e.g., the average classifier output on the best checkpoint's rollouts):

    γ=1Dπ0,istDπ0,iqϕ(y=1st)\gamma = \frac{1}{|\mathcal{D}_{\pi_0,i^*}|} \sum_{s_t \in \mathcal{D}_{\pi_0,i^*}} q_{\phi^*}(y=1 | s_t)

  • Demos with qˉϕ(τ)>γ\bar{q}_{\phi^*}(\tau) > \gamma form the filtered set Ddemo,filt\mathcal{D}_{demo,filt}.

  1. Re-Training or Fine-Tuning:
    • A final policy is trained exclusively on the filtered, high-quality demonstration set, optionally supplemented with additional high-performing rollouts.

3. Classifier and Cross-Validation Details

The classifier qϕq_\phi is evaluated at the state or trajectory level. The loss function per trajectory:

Lϕ(τ,y)=1Tt=1T[ylogqϕ(y=1st)(1y)log(1qϕ(y=1st))]L_\phi(\tau, y) = \frac{1}{T} \sum_{t=1}^{T} \Big[ -y \log q_\phi(y=1|s_t) - (1-y)\log(1-q_\phi(y=1|s_t)) \Big]

Across a collection of MM trajectories (with binary labels), the total loss is averaged accordingly.

To generalize across policy improvements (i.e., as π0\pi_0 improves, the rollout distribution shifts), rollouts are collected at CC checkpoints. Training on C1C-1 and validating on the last ensures qϕq_\phi^* does not merely overfit to the quirks of early, error-prone rollouts.

4. Demonstration Filtering and Scoring Criteria

The key mechanism is to use the expected rollout success probabilities as the decision boundary for what constitutes a "reliable" demonstration. Instead of relying on human assessments or task-specific heuristics, Demo-SCORE lets the classifier’s prediction of empirical success rate—conditioned on the robot's own execution history—drive the filtering.

By applying qϕq_\phi^* to all demos, high-scoring trajectories are kept, suboptimal ones are discarded. The approach is flexible in its granularity: the classifier can be used to score entire episodes or at a finer sub-trajectory ("chunk") level, retaining partial demonstrations when useful.

5. Experimental Evidence and Performance Gains

In both simulation and real-world robot manipulation tasks, Demo-SCORE significantly outperformed training with unfiltered demonstration sets:

  • Success Rate Improvement: Policies trained after Demo-SCORE filtering achieved an absolute success rate 15–35% higher than the baseline.
  • Simulated Results: In tasks like square peg insertion, success reached 94%—much higher than with all-original-demos policies or simple methods such as Auto-IL.
  • Real-World Results: On bimanual robot tasks (ALOHA platform), substantial improvements were observed (e.g., on the Jellybean task, aggregated subtask score rose from 27% to 49%; full task success rate increased from 0% to 20%).
  • Ablation: Classifier cross-validation was shown to prevent overfitting; omitting this step resulted in poor generalization to the demonstration domain.

6. Implications and Extensions

Demo-SCORE has several significant implications for robotic learning:

  • Automated, Experience-Driven Curation: The method removes the need for manual data curation or complex scripted filters, delegating this judgement to the agent’s own policy outcome distribution.
  • Policy Robustness: Filtering based on observed success-rate ensures the final policy over-represents reliable, reproducible strategies and de-emphasizes brittle or rare ones.
  • Adaptive and Scalable: The procedure scales to very large datasets and can be tuned for different levels of selectivity or granularity, including mixing demonstration retention with chunk-level retention.
  • Integration with Broader Learning Pipelines: Quality scores from Demo-SCORE can be used as weights in the policy loss, feeding into curriculum learning or adaptive imitation learning schemes.

A plausible implication is that this paradigm will become critical as demonstration datasets grow larger, more heterogeneous, and less manually curated, especially in industrial or dynamic learning settings.

7. Limitations and Future Directions

Demo-SCORE's effectiveness is currently contingent upon the quality of rollout execution and the specificity of task-success signals. Extending the success classifier to richer reward functions, using multimodal policy feedback, or integrating with more sophisticated meta-learning/curriculum mechanisms may further refine its filtering ability. Future research can also explore joint optimization of the imitation policy and demonstration filtration, closing the feedback loop between data curation and policy improvement. Additionally, qualitative assessment of strategy diversity after filtering may help balance robustness with exploration of novel behaviors.


Demo-SCORE represents a principled, empirically validated approach for filtering demonstration datasets by learning—directly from the robot’s performance—what constitutes a reproducibly successful strategy, dramatically enhancing final policy reliability in imitation learning (Chen et al., 5 Mar 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Demo-SCORE.