Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 67 tok/s

Gemini 2.5 Pro 51 tok/s Pro

GPT-5 Medium 21 tok/s Pro

GPT-5 High 32 tok/s Pro

GPT-4o 120 tok/s Pro

Kimi K2 166 tok/s Pro

GPT OSS 120B 446 tok/s Pro

Claude Sonnet 4.5 35 tok/s Pro

2000 character limit reached

Demo-SCORE: Autonomous Demo Curation

Updated 13 September 2025

Demo-SCORE is a methodology for robot learning that leverages trial outcomes to autonomously score and filter heterogeneous demonstration datasets.
The two-stage framework trains an initial policy to generate rollouts and a classifier to discern successful versus suboptimal trajectories, leading to improved policy performance.
By automating the curation of demonstration data, Demo-SCORE significantly enhances imitation learning, with noted success improvements in both simulated and real-world tasks.

Demo-SCORE refers to "Demonstration Self-Curated via Online Rollout Experience," a methodology for robot learning that leverages the robot’s own trial outcomes to autonomously score and curate heterogeneous demonstration datasets. This approach specifically addresses the problem of mixed-quality demonstrations—which can be detrimental to robot policy optimization—by using a learned classifier to filter out suboptimal examples before final imitation learning. Developed for both simulation and real-world robotic settings, Demo-SCORE has been shown to dramatically improve final task success rates compared to using all available demonstrations indiscriminately (Chen et al., 5 Mar 2025).

1. Motivation and Heterogeneous Demonstration Datasets

Robotic imitation learning often begins with a collection of human or autonomous demonstrations. However, datasets typically contain varied strategies and levels of execution quality, some of which may be unreliable or underrepresented. Subpar or infrequent strategies—if included in the training data—can cause the resulting policy to malfunction when such behaviors surface at test time. Manual curation by humans is costly and challenging, especially when unreliable strategies are subtle or require expert judgement to detect. Demo-SCORE was proposed to automate this curation process based on the robot’s empirical policy performance when trained with the heterogeneous dataset.

2. Methodological Overview

Demo-SCORE consists of a two-stage framework:

Policy Training and Rollout Collection:
- An initial policy $\pi_0$ is trained via imitation learning using the full set of demonstrations $\mathcal{D}_{demo}$ .
- Rollouts are executed by $\pi_0$ across several training checkpoints, each outcome (trajectory) labeled as success or failure.
- These rollout datasets $\mathcal{D}_{\pi_0,1}, ..., \mathcal{D}_{\pi_0,C}$ are used to reflect "how reliably the policy imitates the demos."
Quality Classifier Training and Cross-Validation:
- A classifier $q_\phi$ (commonly a small MLP) is trained to distinguish trajectory (or state) success from failure.
- To avoid overfitting to any specific rollout distribution, cross-validation is conducted: train $q_\phi$ on rollouts from certain checkpoints; validate on others.
- The best $q_\phi^*$ is selected as the one minimizing binary cross-entropy loss on the held-out set.
Demonstration Scoring and Filtering:
- The classifier is applied to each demonstration $\tau$ ; an average score $\bar{q}_{\phi^*}(\tau)$ is computed:
$\bar{q}_{\phi^*}(\tau) = \frac{1}{T} \sum_{t=1}^T q_{\phi^*}(y=1|s_t)$

A threshold $\gamma$ is established (e.g., the average classifier output on the best checkpoint's rollouts):

$\gamma = \frac{1}{|\mathcal{D}_{\pi_0,i^*}|} \sum_{s_t \in \mathcal{D}_{\pi_0,i^*}} q_{\phi^*}(y=1 | s_t)$
Demos with $\bar{q}_{\phi^*}(\tau) > \gamma$ form the filtered set $\mathcal{D}_{demo,filt}$ .

Re-Training or Fine-Tuning:
- A final policy is trained exclusively on the filtered, high-quality demonstration set, optionally supplemented with additional high-performing rollouts.

3. Classifier and Cross-Validation Details

The classifier $q_\phi$ is evaluated at the state or trajectory level. The loss function per trajectory:

$L_\phi(\tau, y) = \frac{1}{T} \sum_{t=1}^{T} \Big[ -y \log q_\phi(y=1|s_t) - (1-y)\log(1-q_\phi(y=1|s_t)) \Big]$

Across a collection of $M$ trajectories (with binary labels), the total loss is averaged accordingly.

To generalize across policy improvements (i.e., as $\pi_0$ improves, the rollout distribution shifts), rollouts are collected at $C$ checkpoints. Training on $C-1$ and validating on the last ensures $q_\phi^*$ does not merely overfit to the quirks of early, error-prone rollouts.

4. Demonstration Filtering and Scoring Criteria

The key mechanism is to use the expected rollout success probabilities as the decision boundary for what constitutes a "reliable" demonstration. Instead of relying on human assessments or task-specific heuristics, Demo-SCORE lets the classifier’s prediction of empirical success rate—conditioned on the robot's own execution history—drive the filtering.

By applying $q_\phi^*$ to all demos, high-scoring trajectories are kept, suboptimal ones are discarded. The approach is flexible in its granularity: the classifier can be used to score entire episodes or at a finer sub-trajectory ("chunk") level, retaining partial demonstrations when useful.

5. Experimental Evidence and Performance Gains

In both simulation and real-world robot manipulation tasks, Demo-SCORE significantly outperformed training with unfiltered demonstration sets:

Success Rate Improvement: Policies trained after Demo-SCORE filtering achieved an absolute success rate 15–35% higher than the baseline.
Simulated Results: In tasks like square peg insertion, success reached 94%—much higher than with all-original-demos policies or simple methods such as Auto-IL.
Real-World Results: On bimanual robot tasks (ALOHA platform), substantial improvements were observed (e.g., on the Jellybean task, aggregated subtask score rose from 27% to 49%; full task success rate increased from 0% to 20%).
Ablation: Classifier cross-validation was shown to prevent overfitting; omitting this step resulted in poor generalization to the demonstration domain.

6. Implications and Extensions

Demo-SCORE has several significant implications for robotic learning:

Automated, Experience-Driven Curation: The method removes the need for manual data curation or complex scripted filters, delegating this judgement to the agent’s own policy outcome distribution.
Policy Robustness: Filtering based on observed success-rate ensures the final policy over-represents reliable, reproducible strategies and de-emphasizes brittle or rare ones.
Adaptive and Scalable: The procedure scales to very large datasets and can be tuned for different levels of selectivity or granularity, including mixing demonstration retention with chunk-level retention.
Integration with Broader Learning Pipelines: Quality scores from Demo-SCORE can be used as weights in the policy loss, feeding into curriculum learning or adaptive imitation learning schemes.

A plausible implication is that this paradigm will become critical as demonstration datasets grow larger, more heterogeneous, and less manually curated, especially in industrial or dynamic learning settings.

7. Limitations and Future Directions

Demo-SCORE's effectiveness is currently contingent upon the quality of rollout execution and the specificity of task-success signals. Extending the success classifier to richer reward functions, using multimodal policy feedback, or integrating with more sophisticated meta-learning/curriculum mechanisms may further refine its filtering ability. Future research can also explore joint optimization of the imitation policy and demonstration filtration, closing the feedback loop between data curation and policy improvement. Additionally, qualitative assessment of strategy diversity after filtering may help balance robustness with exploration of novel behaviors.

Demo-SCORE represents a principled, empirically validated approach for filtering demonstration datasets by learning—directly from the robot’s performance—what constitutes a reproducibly successful strategy, dramatically enhancing final policy reliability in imitation learning (Chen et al., 5 Mar 2025).

PDF Markdown Chat (Pro)

References (1)

Curating Demonstrations using Online Experience (2025)

Follow Topic

Get notified by email when new papers are published related to Demo-SCORE.