Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 67 tok/s

Gemini 2.5 Pro 51 tok/s Pro

GPT-5 Medium 21 tok/s Pro

GPT-5 High 32 tok/s Pro

GPT-4o 120 tok/s Pro

Kimi K2 166 tok/s Pro

GPT OSS 120B 446 tok/s Pro

Claude Sonnet 4.5 35 tok/s Pro

2000 character limit reached

DataMIL: Selecting Data for Robot Imitation Learning with Datamodels (2505.09603v1)

Published 14 May 2025 in cs.RO and cs.LG

Abstract: Recently, the robotics community has amassed ever larger and more diverse datasets to train generalist robot policies. However, while these policies achieve strong mean performance across a variety of tasks, they often underperform on individual, specialized tasks and require further tuning on newly acquired task-specific data. Combining task-specific data with carefully curated subsets of large prior datasets via co-training can produce better specialized policies, but selecting data naively may actually harm downstream performance. To address this, we introduce DataMIL, a policy-driven data selection framework built on the datamodels paradigm that reasons about data selection in an end-to-end manner, using the policy itself to identify which data points will most improve performance. Unlike standard practices that filter data using human notions of quality (e.g., based on semantic or visual similarity), DataMIL directly optimizes data selection for task success, allowing us to select data that enhance the policy while dropping data that degrade it. To avoid performing expensive rollouts in the environment during selection, we use a novel surrogate loss function on task-specific data, allowing us to use DataMIL in the real world without degrading performance. We validate our approach on a suite of more than 60 simulation and real-world manipulation tasks - most notably showing successful data selection from the Open X-Embodiment datasets-demonstrating consistent gains in success rates and superior performance over multiple baselines. Our results underscore the importance of end-to-end, performance-aware data selection for unlocking the potential of large prior datasets in robotics. More information at https://robin-lab.cs.utexas.edu/datamodels4imitation/

Summary

Overview of DataMIL: Selecting Data for Robot Imitation Learning with Datamodels

DataMIL presents a nuanced approach to data selection for robot imitation learning, addressing specific challenges that arise when training policies using diverse robotic datasets. This paper proposes a policy-driven data selection framework built on the datamodels paradigm, which predicts the influence of specific data points on the performance of a trained policy. Unlike conventional methods that rely on heuristics or human-centric notions of data quality, DataMIL optimizes data selection directly for task success by evaluating how each piece of data impacts desired outcomes.

Core Contributions

Datamodels Framework Extension: The methodology leverages datamodels to estimate the influence of individual data points without expensive policy rollouts, a significant advancement given the high cost associated with such evaluations in robotics. The framework is adapted specifically for robotic settings, considering the unique challenges of heterogeneous, large-scale datasets and varying embodiments.
Proxy Metrics and Surrogates: To address the impracticality of real-world rollouts for evaluating data impact, DataMIL utilizes a novel proxy metric that estimates validation loss as a surrogate for eventual task success. This proxy metric ensures that data selection is feasible without compromising accuracy.
Empirical Validation: DataMIL has been tested on more than 60 tasks across simulation and real-world robotic manipulation scenarios. The approach has shown substantial improvements in success rates compared to state-of-the-art baselines, demonstrating its effectiveness in curating datasets that enhance policy performance.

Implications and Future Directions

The implications of DataMIL are profound, particularly for training specialization-focused policies in robotics employing large datasets. By facilitating more precise data selection, DataMIL contributes to bridging the gap between generalization and task-specific competence, thus potentially reducing the need for extensive post-training fine-tuning. Future research may explore variations in datamodel estimators or incorporate scalable approaches when dealing with even larger datasets and models. Additionally, the datamodels methodology could be integrated with emerging AI techniques, potentially enhancing robotic tasks' adaptability and optimization.

Numerical Results and Claims

The success rates achieved by DataMIL, noted to be up to seven times higher than those trained on non-curated datasets, underscore its efficacy in optimizing data selection for robot imitation learning. Such strong empirical results validate the authors' claim that performance-aware, end-to-end data selection is crucial for leveraging large prior datasets effectively.

Conclusions

DataMIL represents a practical and theoretically rigorous approach to improving imitation learning outcomes for robots by smartly combining task-specific data with curating relevant subsets from extensive dataset collections. It pioneers the efficient use of datamodels in this domain and sets the stage for future advancements in intelligent data-driven approaches to robotics. Through its robust validation and application, DataMIL has demonstrated its potential to significantly refine the data curation process, leading to the development of more competent specialized robotic policies.