Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Efficient Data Collection for Robotic Manipulation via Compositional Generalization (2403.05110v2)

Published 8 Mar 2024 in cs.RO, cs.AI, and cs.LG

Abstract: Data collection has become an increasingly important problem in robotic manipulation, yet there still lacks much understanding of how to effectively collect data to facilitate broad generalization. Recent works on large-scale robotic data collection typically vary many environmental factors of variation (e.g., object types, table textures) during data collection, to cover a diverse range of scenarios. However, they do not explicitly account for the possible compositional abilities of policies trained on the data. If robot policies can compose environmental factors from their data to succeed when encountering unseen factor combinations, we can exploit this to avoid collecting data for situations that composition would address. To investigate this possibility, we conduct thorough empirical studies both in simulation and on a real robot that compare data collection strategies and assess whether visual imitation learning policies can compose environmental factors. We find that policies do exhibit composition, although leveraging prior robotic datasets is critical for this on a real robot. We use these insights to propose better in-domain data collection strategies that exploit composition, which can induce better generalization than naive approaches for the same amount of effort during data collection. We further demonstrate that a real robot policy trained on data from such a strategy achieves a success rate of 77.5% when transferred to entirely new environments that encompass unseen combinations of environmental factors, whereas policies trained using data collected without accounting for environmental variation fail to transfer effectively, with a success rate of only 2.5%. We provide videos at http://iliad.stanford.edu/robot-data-comp/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. C-vqa: A compositional split of the visual question answering (vqa) v1. 0 dataset. arXiv preprint arXiv:1704.08243, 2017.
  2. Autort: Embodied foundation models for large scale orchestration of robotic agents, 2024.
  3. Data quality in imitation learning. conference on neural information processing systems. In Conference on Neural Information Processing Systems (NeurIPS), 2023.
  4. Covr: A test-bed for visually grounded compositional generalization with real images. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 9824–9846, 2021.
  5. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022.
  6. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In arXiv preprint arXiv:2307.15818, 2023.
  7. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  8. What makes pre-trained visual representations successful for robust manipulation? arXiv preprint arXiv:2312.12444, 2023.
  9. Diffusion policy: Visuomotor policy learning via action diffusion. In Proceedings of Robotics: Science and Systems (RSS), 2023.
  10. Robonet: Large-scale multi-robot learning. In Conference on Robot Learning, pages 885–897. PMLR, 2020.
  11. An unbiased look at datasets for visuo-motor pre-training. In Conference on Robot Learning, pages 1183–1198. PMLR, 2023.
  12. Learning modular neural network policies for multi-task and multi-robot transfer. In 2017 IEEE international conference on robotics and automation (ICRA), pages 2169–2176. IEEE, 2017.
  13. Compositional semantic parsing with large language models. In The Eleventh International Conference on Learning Representations, 2023.
  14. Compositional visual generation with energy based models. Advances in Neural Information Processing Systems, 33:6637–6647, 2020.
  15. Self-supervised visual planning with temporal skip connections. CoRL, 12:16, 2017.
  16. Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain Datasets. In Proceedings of Robotics: Science and Systems, New York City, NY, USA, June 2022. doi: 10.15607/RSS.2022.XVIII.063.
  17. Rh20t: A robotic dataset for learning diverse skills in one-shot. In RSS 2023 Workshop on Learning for Task and Motion Planning, 2023.
  18. Do as i can, not as i say: Grounding language in robotic affordances. In 6th Annual Conference on Robot Learning, 2022.
  19. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2901–2910, 2017.
  20. Language-driven representation learning for robotics. In Robotics: Science and Systems (RSS), 2023.
  21. Hg-dagger: Interactive imitation learning with human experts. In 2019 International Conference on Robotics and Automation (ICRA), pages 8077–8083. IEEE, 2019.
  22. Measuring compositional generalization: A comprehensive method on realistic data. In International Conference on Learning Representations, 2019.
  23. Cogs: A compositional generalization challenge based on semantic interpretation. In 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, pages 9087–9105. Association for Computational Linguistics (ACL), 2020.
  24. Deep compositional robotic planners that follow natural language commands. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pages 4906–4912. IEEE, 2020.
  25. Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks. In International conference on machine learning, pages 2873–2882. PMLR, 2018.
  26. Dart: Noise injection for robust imitation learning. In Conference on robot learning, pages 143–156. PMLR, 2017.
  27. Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. The International journal of robotics research, 37(4-5):421–436, 2018.
  28. Compositional visual generation with composable diffusion models. In European Conference on Computer Vision, pages 423–439. Springer, 2022.
  29. Learning latent plans from play. In Conference on robot learning, pages 1113–1132. PMLR, 2020.
  30. Vip: Towards universal visual reward and representation via value-implicit pre-training. In The Eleventh International Conference on Learning Representations, 2022.
  31. Where are we in the search for an artificial visual cortex for embodied intelligence? In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  32. Roboturk: A crowdsourcing platform for robotic skill learning through imitation. In Conference on Robot Learning, pages 879–893. PMLR, 2018.
  33. R3m: A universal visual representation for robot manipulation. In Conference on Robot Learning, pages 892–909. PMLR, 2022.
  34. Octo: An open-source generalist robot policy. https://octo-models.github.io, 2023.
  35. Open x-embodiment: Robotic learning datasets and rt-x models. arXiv preprint arXiv:2310.08864, 2023.
  36. The unsurprising effectiveness of pre-trained vision models for control. In International Conference on Machine Learning, pages 17359–17371. PMLR, 2022.
  37. Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours. In 2016 IEEE international conference on robotics and automation (ICRA), pages 3406–3413. IEEE, 2016.
  38. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  39. Real-world robot learning with masked visual pre-training. In Conference on Robot Learning, pages 416–426. PMLR, 2022.
  40. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Proceedings, 2011.
  41. Visual representation learning does not generalize strongly within the same domain. In International Conference on Learning Representations, 2021.
  42. Multiple interactions made easy (mime): Large scale demonstrations data for imitation. In Conference on robot learning, pages 906–915. PMLR, 2018.
  43. Open-world object manipulation using pre-trained vision-language models. In 7th Annual Conference on Robot Learning, 2023.
  44. Bridgedata v2: A dataset for robot learning at scale. In Conference on Robot Learning, pages 1723–1736. PMLR, 2023.
  45. Programmatically grounded, compositionally generalizable robotic manipulation. In The Eleventh International Conference on Learning Representations, 2023.
  46. Decomposing the generalization gap in imitation learning for visual robotic manipulation. arXiv preprint arXiv:2307.03659, 2023.
  47. Kitchenshift: Evaluating zero-shot generalization of imitation-based policy learning under domain shifts. In NeurIPS 2021 Workshop on Distribution Shifts: Connecting Methods and Applications, 2021.
  48. Neural task programming: Learning to generalize across hierarchical tasks. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 3795–3802. IEEE, 2018.
  49. Compositional generalization in unsupervised compositional representation learning: A study on disentanglement and emergent language. Advances in Neural Information Processing Systems, 35:25074–25087, 2022.
  50. Policy architectures for compositional generalization in control. arXiv preprint arXiv:2203.05960, 2022.
  51. Least-to-most prompting enables complex reasoning in large language models. In The Eleventh International Conference on Learning Representations, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Jensen Gao (9 papers)
  2. Annie Xie (21 papers)
  3. Ted Xiao (40 papers)
  4. Chelsea Finn (264 papers)
  5. Dorsa Sadigh (162 papers)
Citations (9)

Summary

This paper investigates how to efficiently collect data for training robotic manipulation policies that can generalize to new, unseen scenarios. The core idea is to exploit "compositional generalization," where a policy trained on data covering individual environmental factors (e.g., different object types, various table heights) can successfully operate in situations with unseen combinations of these factors.

The authors hypothesize that if policies can compose learned skills across different environmental variations, data collection can be made significantly more efficient. Instead of collecting data for every possible combination of factors (which scales exponentially, O(kN)\mathcal{O}(k^N) for NN factors with kk values each), one could focus on covering individual factor values (O(kN)\mathcal{O}(kN)).

Key Questions Addressed:

  1. When do robotic imitation learning policies exhibit compositional generalization?
  2. What are effective data collection strategies to exploit composition for broad generalization while reducing effort?

Data Collection Strategies Proposed and Compared:

The paper defines and evaluates several data collection strategies, visualized for N=2N=2 factors:

  • No Variation: Data collected for only a single combination of factor values.
  • Single Factor: Varies only one factor at a time, keeping others at base values.
  • Random: Periodically resamples entire random combinations of all factor values.
  • Diagonal: Samples new combinations where each factor value has never been seen. Covers all values with O(kN)\mathcal{O}(kN) changes.
  • L: Varies one factor at a time from a base set of factor values. Covers all values with O(kN)\mathcal{O}(kN) changes.
  • Stair: Cyclically varies one factor at a time, preserving other values. Covers all values with O(kN)\mathcal{O}(kN) changes but captures more diverse combinations than Diagonal or L for the same number of factor changes.
  • Complete: Covers all possible combinations of factor values (often infeasible).

The strategies "Stair," "L," and "Diagonal" are designed to exploit compositional generalization by prioritizing coverage of individual factor values efficiently.

Experiments:

The authors conduct extensive experiments in both simulation and on a real robot:

  1. Simulation Experiments:
    • Platform: Factor World, a simulation environment supporting variations in environmental factors.
    • Tasks: Pick Place and Door Open.
    • Factors Varied (up to 5): Object position, object texture, table texture, camera position, and distractor objects.
    • Evaluation: Policies trained using behavior cloning on datasets collected with different strategies. Performance is measured by success rate on unseen combinations of factor values.
    • Settings:
      • Pairwise composition (N=2N=2 factors, k=10k=10 values each).
      • Multi-factor composition (N=5N=5 factors, k=10k=10 values each).
  2. Real Robot Experiments:
    • Platform: WidowX 250 6DOF robot arm in a real office kitchen. Task: putting a fork into a container.
    • Factors Varied (primarily 5 physical/visual factors): Object type (forks), container type, table height, table texture, object position.
    • Data Collection: Human demonstrations (160 total for "L" and "Stair" strategies, covering 16 combinations each).
    • Policy: Diffusion goal-conditioned behavior cloning, with and without pre-training/co-fine-tuning on BridgeData V2 (a large prior robotic dataset).
    • Evaluations:
      • Pairwise Composition: Assessed in the "BaseKitch" environment on 9 unseen combinations for each of 10 factor pairs.
      • Out-of-Domain (OOD) Transfer: Policies trained in "BaseKitch" are tested in two new kitchens ("CompKitch," "TileKitch") with inherent differences (e.g., table texture, lighting, distractors) and additional factor shifts.
      • Unaccounted Factors: Robustness to distractor objects (a held-out factor) in BaseKitch.
      • Camera Position Composition: Composition of camera position (main vs. secondary camera) with table texture.

Key Findings:

  • Compositional Generalization Exists:
    • In simulation, policies showed strong pairwise compositional abilities. Strategies like Stair, L, and Diagonal outperformed Random and approached Complete with fewer factor changes.
    • On the real robot, policies also exhibited composition, particularly when leveraging prior data (BridgeData V2). The "L" strategy with prior data succeeded in 59/90 unseen pairwise combinations, compared to 28/90 without prior data and 22/90 for "No Variation" with prior data.
  • Prior Data is Crucial for Real Robots:
    • Leveraging prior datasets like BridgeData V2 significantly enhanced compositional abilities on the real robot. This was less critical in the cleaner simulation environment.
    • Co-fine-tuning on a mix of in-domain and prior data was generally more effective than just fine-tuning a pre-trained model.
    • Prior data also helped maintain robustness to unaccounted factors (e.g., distractors) that might be negatively impacted by sparse in-domain data collection strategies.
  • Effective Data Collection Strategies:
    • Stair generally performed best across simulation (especially for N=5N=5 factors) and real-robot experiments (best OOD transfer: 31/40 success rate with co-fine-tuning). It balances efficient coverage of individual factors with exposure to a greater diversity of combinations compared to L or Diagonal.
    • L also showed strong performance, particularly for pairwise composition analysis and OOD transfer (24/40 success with co-fine-tuning). It can be practically easier if varying factors separately is more convenient.
    • Strategies exploiting composition (Stair, L, Diagonal) significantly outperformed "No Variation" and often "Random" for the same data collection effort (measured by factor changes).
  • Challenges in Composition:
    • Composition was generally weaker for pairs of physical factors that interact in complex ways (e.g., object position and table height, both affecting grasp motion). Visual factors or factors with less physical interaction composed more easily.
  • Out-of-Domain Transfer:
    • The best policy (Stair + BridgeData V2 co-fine-tuning) achieved a 77.5% (31/40) success rate in entirely new kitchens with unseen combinations of factors.
    • Policies trained without prior data or without variation in in-domain data failed to transfer effectively (e.g., 0/40 and 1/40 respectively).

Practical Implications and Contributions:

The paper provides actionable insights for roboticists collecting in-domain data:

  1. Prioritize Factor Variation: Even if not all combinations can be covered, varying individual factors is crucial.
  2. Use Efficient Strategies: Strategies like "Stair" or "L" can achieve good generalization with significantly less data collection effort than trying to cover all combinations or relying on purely random variations.
  3. Leverage Prior Datasets: Incorporating large, diverse prior datasets (like BridgeData V2) through pre-training and co-fine-tuning is critical for robust compositional generalization, especially on real robots.
  4. Consider Factor Interactions: Be aware that composition might be harder for factors that have complex physical interactions. More data might be needed for such combinations.
  5. Co-fine-tuning is Preferred: When using prior data, co-fine-tuning with a mix of prior and new in-domain data seems more effective than just fine-tuning.

The research demonstrates that by understanding and exploiting the compositional generalization capabilities of imitation learning policies, data collection for robotic manipulation can be made more systematic and efficient, leading to policies that generalize better to novel environments and tasks.