Demo-SCORE: Autonomous Demo Curation
- Demo-SCORE is a methodology for robot learning that leverages trial outcomes to autonomously score and filter heterogeneous demonstration datasets.
- The two-stage framework trains an initial policy to generate rollouts and a classifier to discern successful versus suboptimal trajectories, leading to improved policy performance.
- By automating the curation of demonstration data, Demo-SCORE significantly enhances imitation learning, with noted success improvements in both simulated and real-world tasks.
Demo-SCORE refers to "Demonstration Self-Curated via Online Rollout Experience," a methodology for robot learning that leverages the robot’s own trial outcomes to autonomously score and curate heterogeneous demonstration datasets. This approach specifically addresses the problem of mixed-quality demonstrations—which can be detrimental to robot policy optimization—by using a learned classifier to filter out suboptimal examples before final imitation learning. Developed for both simulation and real-world robotic settings, Demo-SCORE has been shown to dramatically improve final task success rates compared to using all available demonstrations indiscriminately (Chen et al., 5 Mar 2025).
1. Motivation and Heterogeneous Demonstration Datasets
Robotic imitation learning often begins with a collection of human or autonomous demonstrations. However, datasets typically contain varied strategies and levels of execution quality, some of which may be unreliable or underrepresented. Subpar or infrequent strategies—if included in the training data—can cause the resulting policy to malfunction when such behaviors surface at test time. Manual curation by humans is costly and challenging, especially when unreliable strategies are subtle or require expert judgement to detect. Demo-SCORE was proposed to automate this curation process based on the robot’s empirical policy performance when trained with the heterogeneous dataset.
2. Methodological Overview
Demo-SCORE consists of a two-stage framework:
- Policy Training and Rollout Collection:
- An initial policy is trained via imitation learning using the full set of demonstrations .
- Rollouts are executed by across several training checkpoints, each outcome (trajectory) labeled as success or failure.
- These rollout datasets are used to reflect "how reliably the policy imitates the demos."
- Quality Classifier Training and Cross-Validation:
- A classifier (commonly a small MLP) is trained to distinguish trajectory (or state) success from failure.
- To avoid overfitting to any specific rollout distribution, cross-validation is conducted: train on rollouts from certain checkpoints; validate on others.
- The best is selected as the one minimizing binary cross-entropy loss on the held-out set.
- Demonstration Scoring and Filtering:
- The classifier is applied to each demonstration ; an average score is computed:
A threshold is established (e.g., the average classifier output on the best checkpoint's rollouts):
Demos with form the filtered set .
- Re-Training or Fine-Tuning:
- A final policy is trained exclusively on the filtered, high-quality demonstration set, optionally supplemented with additional high-performing rollouts.
3. Classifier and Cross-Validation Details
The classifier is evaluated at the state or trajectory level. The loss function per trajectory:
Across a collection of trajectories (with binary labels), the total loss is averaged accordingly.
To generalize across policy improvements (i.e., as improves, the rollout distribution shifts), rollouts are collected at checkpoints. Training on and validating on the last ensures does not merely overfit to the quirks of early, error-prone rollouts.
4. Demonstration Filtering and Scoring Criteria
The key mechanism is to use the expected rollout success probabilities as the decision boundary for what constitutes a "reliable" demonstration. Instead of relying on human assessments or task-specific heuristics, Demo-SCORE lets the classifier’s prediction of empirical success rate—conditioned on the robot's own execution history—drive the filtering.
By applying to all demos, high-scoring trajectories are kept, suboptimal ones are discarded. The approach is flexible in its granularity: the classifier can be used to score entire episodes or at a finer sub-trajectory ("chunk") level, retaining partial demonstrations when useful.
5. Experimental Evidence and Performance Gains
In both simulation and real-world robot manipulation tasks, Demo-SCORE significantly outperformed training with unfiltered demonstration sets:
- Success Rate Improvement: Policies trained after Demo-SCORE filtering achieved an absolute success rate 15–35% higher than the baseline.
- Simulated Results: In tasks like square peg insertion, success reached 94%—much higher than with all-original-demos policies or simple methods such as Auto-IL.
- Real-World Results: On bimanual robot tasks (ALOHA platform), substantial improvements were observed (e.g., on the Jellybean task, aggregated subtask score rose from 27% to 49%; full task success rate increased from 0% to 20%).
- Ablation: Classifier cross-validation was shown to prevent overfitting; omitting this step resulted in poor generalization to the demonstration domain.
6. Implications and Extensions
Demo-SCORE has several significant implications for robotic learning:
- Automated, Experience-Driven Curation: The method removes the need for manual data curation or complex scripted filters, delegating this judgement to the agent’s own policy outcome distribution.
- Policy Robustness: Filtering based on observed success-rate ensures the final policy over-represents reliable, reproducible strategies and de-emphasizes brittle or rare ones.
- Adaptive and Scalable: The procedure scales to very large datasets and can be tuned for different levels of selectivity or granularity, including mixing demonstration retention with chunk-level retention.
- Integration with Broader Learning Pipelines: Quality scores from Demo-SCORE can be used as weights in the policy loss, feeding into curriculum learning or adaptive imitation learning schemes.
A plausible implication is that this paradigm will become critical as demonstration datasets grow larger, more heterogeneous, and less manually curated, especially in industrial or dynamic learning settings.
7. Limitations and Future Directions
Demo-SCORE's effectiveness is currently contingent upon the quality of rollout execution and the specificity of task-success signals. Extending the success classifier to richer reward functions, using multimodal policy feedback, or integrating with more sophisticated meta-learning/curriculum mechanisms may further refine its filtering ability. Future research can also explore joint optimization of the imitation policy and demonstration filtration, closing the feedback loop between data curation and policy improvement. Additionally, qualitative assessment of strategy diversity after filtering may help balance robustness with exploration of novel behaviors.
Demo-SCORE represents a principled, empirically validated approach for filtering demonstration datasets by learning—directly from the robot’s performance—what constitutes a reproducibly successful strategy, dramatically enhancing final policy reliability in imitation learning (Chen et al., 5 Mar 2025).