Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 67 tok/s
Gemini 2.5 Pro 51 tok/s Pro
GPT-5 Medium 21 tok/s Pro
GPT-5 High 32 tok/s Pro
GPT-4o 120 tok/s Pro
Kimi K2 166 tok/s Pro
GPT OSS 120B 446 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

Curating Demonstrations using Online Experience (2503.03707v2)

Published 5 Mar 2025 in cs.RO, cs.AI, and cs.LG

Abstract: Many robot demonstration datasets contain heterogeneous demonstrations of varying quality. This heterogeneity may benefit policy pre-training, but can hinder robot performance when used with a final imitation learning objective. In particular, some strategies in the data may be less reliable than others or may be underrepresented in the data, leading to poor performance when such strategies are sampled at test time. Moreover, such unreliable or underrepresented strategies can be difficult even for people to discern, and sifting through demonstration datasets is time-consuming and costly. On the other hand, policy performance when trained on such demonstrations can reflect the reliability of different strategies. We thus propose for robots to self-curate based on online robot experience (Demo-SCORE). More specifically, we train and cross-validate a classifier to discern successful policy roll-outs from unsuccessful ones and use the classifier to filter heterogeneous demonstration datasets. Our experiments in simulation and the real world show that Demo-SCORE can effectively identify suboptimal demonstrations without manual curation. Notably, Demo-SCORE achieves over 15-35% higher absolute success rate in the resulting policy compared to the base policy trained with all original demonstrations.

Summary

  • The paper presents Demo-SCORE, a method that automates filtering of robot demonstration datasets using online self-curation to enhance imitation learning.
  • It employs a four-step process including initial training, rollout evaluation, classifier development, and dataset filtering to identify optimal demonstrations.
  • Experimental results show improvement of 15-35% in simulated tasks and nearly 30% in real-world applications, highlighting its impact on policy success rates.

Curating Demonstrations Using Online Experience

The paper "Curating Demonstrations using Online Experience" presents Demo-SCORE, an approach to automate the filtering of robot demonstration datasets through online self-curation by the robot. This method aims to improve policy performance by identifying and discarding suboptimal demonstrations that are not beneficial for learning reliable imitation strategies.

Introduction

Demo-SCORE addresses the challenge of leveraging heterogeneous demonstration datasets that contain a mix of high and low-quality demonstrations. As robot learning environments become more complex, they necessitate diverse demonstration strategies that might not always be reliable for robots to imitate. Human demonstrations often include suboptimal strategies due to variance in human execution, leading to poor robot performance when uninterpreted. Figure 1

Figure 1: Human demonstrations can involve unreliable strategies, such as picking up a spoon by its edge, which may not translate effectively to robotic execution.

Methodology

Demo-SCORE automates the curation of demonstration datasets in four key steps:

  1. Initial Policy Training: A policy is trained on the full set of demonstrations, capturing the variability in strategies.
  2. Policy Evaluation through Rollouts: Rollouts are conducted at various checkpoints to evaluate policy performance.
  3. Classifier Development: A classifier is trained using the rollout data to distinguish between successful and failed rollouts, identifying reliable strategies.
  4. Dataset Filtering: Using the classifier, unreliable demonstrations are filtered out to refine the dataset for further policy training. Figure 2

    Figure 2: Illustration of the Demo-SCORE workflow, including policy training, evaluation, classification, and demonstration filtering.

Experimental Evaluation

The effectiveness of Demo-SCORE was validated through experiments on both simulated and real-world tasks.

Simulated Experiments

Two tasks evaluated were the bimanual peg insertion in the ALOHA environment and the square peg task in Robosuite.

  • In the square peg task, the dataset included combinations of human and scripted demonstrations with varying strategies, referred to as 'SquareA' and 'SquareB'.
  • Results indicate that Demo-SCORE outperforms traditional and rollout-based methods, with a reliable increase of 15-35% in success rates by filtering out inferior demonstrations. Figure 3

    Figure 3: Success rates of policies, showing superior performance of Demo-SCORE across different demonstration mixtures.

Real-World Tasks

The real-world evaluation was conducted using a multi-task ALOHA setup, which included tasks like spoon placement in a rack and strawberry picking from a cluttered scene.

  • Demo-SCORE improved the average success rate of policies by nearly 30% compared to the baseline trained on unfiltered demonstrations. Figure 4

    Figure 4: Enhanced success rates observed with real-world ALOHA tasks when applying Demo-SCORE filtering techniques.

Robustness and Generalization

In addressing off-distribution (OOD) initial conditions, Demo-SCORE maintains its robustness. Experiments revealed that even with significant changes in initial task setups or when challenged with OOD environments, the refined datasets preserved by Demo-SCORE allowed for effective policy generalization without loss of state coverage.

Conclusion

Demo-SCORE presents a scalable and effective method for enhancing the performance of robots learning from demonstrations by leveraging online experiences to curate datasets. It effectively discerns and eliminates unreliable strategies that human curation may overlook, thereby improving policy success rates across various tasks. Future work includes extending this approach to broader robotic scenarios and refining classifier techniques for more sophisticated demonstration datasets.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 post and received 1 like.

Youtube Logo Streamline Icon: https://streamlinehq.com