Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

RH20T-P: A Primitive-Level Robotic Dataset Towards Composable Generalization Agents (2403.19622v1)

Published 28 Mar 2024 in cs.RO and cs.CV

Abstract: The ultimate goals of robotic learning is to acquire a comprehensive and generalizable robotic system capable of performing both seen skills within the training distribution and unseen skills in novel environments. Recent progress in utilizing LLMs as high-level planners has demonstrated that the complexity of tasks can be reduced through decomposing them into primitive-level plans, making it possible to generalize on novel robotic tasks in a composable manner. Despite the promising future, the community is not yet adequately prepared for composable generalization agents, particularly due to the lack of primitive-level real-world robotic datasets. In this paper, we propose a primitive-level robotic dataset, namely RH20T-P, which contains about 33000 video clips covering 44 diverse and complicated robotic tasks. Each clip is manually annotated according to a set of meticulously designed primitive skills, facilitating the future development of composable generalization agents. To validate the effectiveness of RH20T-P, we also construct a potential and scalable agent based on RH20T-P, called RA-P. Equipped with two planners specialized in task decomposition and motion planning, RA-P can adapt to novel physical skills through composable generalization. Our website and videos can be found at https://sites.google.com/view/rh20t-primitive/main. Dataset and code will be made available soon.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. Bc-z: Zero-shot task generalization with robotic imitation learning. In Conference on Robot Learning, pages 991–1002. PMLR, 2022.
  2. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:22 12.06817, 2022.
  3. Learning fine-grained bimanual manipulation with low-cost hardware. arXiv preprint arXiv:2304.13705, 2023.
  4. Cliport: What and where pathways for robotic manipulation. In Conference on Robot Learning, pages 894–906. PMLR, 2022.
  5. Modem: Accelerating visual model-based reinforcement learning with demonstrations. arXiv preprint arXiv:2212.05698, 2022.
  6. Language reward modulation for pretraining reinforcement learning. arXiv preprint arXiv:2308.12270, 2023.
  7. Video prediction models as rewards for reinforcement learning. Advances in Neural Information Processing Systems, 36, 2024.
  8. Action-quantized offline reinforcement learning for robotic skill learning. In Conference on Robot Learning, pages 1348–1361. PMLR, 2023.
  9. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  10. OpenAI. Gpt-4 technical report, 2023.
  11. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.
  12. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  13. Llama 2: Open foundation and fine-tuned chat models, 2023. URL https://arxiv. org/abs/2307.09288, 2023b.
  14. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  15. OpenAI. Gpt-4v(ision) system card, 2023.
  16. Visual instruction tuning. Advances in neural information processing systems, 36, 2024.
  17. Lamm: Language-assisted multi-modal instruction-tuning dataset, framework, and benchmark. Advances in Neural Information Processing Systems, 36, 2024.
  18. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023.
  19. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023.
  20. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023.
  21. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
  22. Look before you leap: Unveiling the power of gpt-4v in robotic vision-language planning. arXiv preprint arXiv:2311.17842, 2023.
  23. Gpt-4v (ision) for robotics: Multimodal task planning from human demonstration. arXiv preprint arXiv:2311.12015, 2023.
  24. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022.
  25. What makes good in-context examples for gpt-3333? arXiv preprint arXiv:2101.06804, 2021.
  26. Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159, 2020.
  27. Rh20t: A robotic dataset for learning diverse skills in one-shot. arXiv preprint arXiv:2307.00595, 2023.
  28. C. Lynch and P. Sermanet. Language conditioned imitation learning over unstructured data. arXiv preprint arXiv:2005.07648, 2020.
  29. Learning language-conditioned robot behavior from offline data and crowd-sourced annotation. In Conference on Robot Learning, pages 1303–1315. PMLR, 2022.
  30. Interactive language: Talking to robots in real time. IEEE Robotics and Automation Letters, 2023.
  31. Up-detr: Unsupervised pre-training for object detection with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1601–1610, 2021.
  32. Siamese detr. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15722–15731, 2023.
  33. Contextual object detection with multimodal large language models. arXiv preprint arXiv:2305.18279, 2023.
  34. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023.
  35. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  36. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  37. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023.
  38. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
  39. 3d-llm: Injecting the 3d world into large language models. Advances in Neural Information Processing Systems, 36, 2024.
  40. Pointllm: Empowering large language models to understand point clouds. arXiv preprint arXiv:2308.16911, 2023.
  41. Octavius: Mitigating task interference in mllms via moe. arXiv preprint arXiv:2311.02684, 2023.
  42. Tidybot: Personalized robot assistance with large language models. arXiv preprint arXiv:2305.05658, 2023.
  43. Interactive task planning with language models. arXiv preprint arXiv:2310.10645, 2023.
  44. Creative robot tool use with large language models. arXiv preprint arXiv:2310.13065, 2023.
  45. Text2motion: From natural language instructions to feasible plans. arXiv preprint arXiv:2303.12153, 2023.
  46. Grounded decoding: Guiding text generation with grounded models for robot control. arXiv preprint arXiv:2303.00855, 2023.
  47. Mp5: A multi-modal open-ended embodied system in minecraft via active perception. arXiv preprint arXiv:2312.07472, 2023.
  48. Roboturk: A crowdsourcing platform for robotic skill learning through imitation. In Conference on Robot Learning, pages 879–893. PMLR, 2018.
  49. Robonet: Large-scale multi-robot learning. arXiv preprint arXiv:1910.11215, 2019.
  50. Multiple interactions made easy (mime): Large scale demonstrations data for imitation. In Conference on robot learning, pages 906–915. PMLR, 2018.
  51. Towards more generalizable one-shot visual imitation learning. In 2022 International Conference on Robotics and Automation (ICRA), pages 2434–2444, 2022.
  52. Pivot: Iterative visual prompting elicits actionable knowledge for vlms. arXiv preprint arXiv:2402.07872, 2024.
  53. Manipllm: Embodied multimodal large language model for object-centric robotic manipulation. arXiv preprint arXiv:2312.16217, 2023.
  54. Learning orbitally stable systems for diagrammatic teaching. In CoRL 2023 Workshop on Learning Effective Abstractions for Planning (LEAP), 2023.
  55. Rt-trajectory: Robotic task generalization via hindsight trajectory sketches. arXiv preprint arXiv:2311.01977, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (12)
  1. Zeren Chen (8 papers)
  2. Zhelun Shi (9 papers)
  3. Xiaoya Lu (4 papers)
  4. Lehan He (3 papers)
  5. Sucheng Qian (4 papers)
  6. Hao Shu Fang (1 paper)
  7. Zhenfei Yin (41 papers)
  8. Wanli Ouyang (358 papers)
  9. Jing Shao (109 papers)
  10. Yu Qiao (563 papers)
  11. Cewu Lu (203 papers)
  12. Lu Sheng (63 papers)
Citations (2)
X Twitter Logo Streamline Icon: https://streamlinehq.com