Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

BEHAVIOR-1K: A Human-Centered, Embodied AI Benchmark with 1,000 Everyday Activities and Realistic Simulation (2403.09227v1)

Published 14 Mar 2024 in cs.RO and cs.AI

Abstract: We present BEHAVIOR-1K, a comprehensive simulation benchmark for human-centered robotics. BEHAVIOR-1K includes two components, guided and motivated by the results of an extensive survey on "what do you want robots to do for you?". The first is the definition of 1,000 everyday activities, grounded in 50 scenes (houses, gardens, restaurants, offices, etc.) with more than 9,000 objects annotated with rich physical and semantic properties. The second is OMNIGIBSON, a novel simulation environment that supports these activities via realistic physics simulation and rendering of rigid bodies, deformable bodies, and liquids. Our experiments indicate that the activities in BEHAVIOR-1K are long-horizon and dependent on complex manipulation skills, both of which remain a challenge for even state-of-the-art robot learning solutions. To calibrate the simulation-to-reality gap of BEHAVIOR-1K, we provide an initial study on transferring solutions learned with a mobile manipulator in a simulated apartment to its real-world counterpart. We hope that BEHAVIOR-1K's human-grounded nature, diversity, and realism make it valuable for embodied AI and robot learning research. Project website: https://behavior.stanford.edu.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (85)
  1. Imagenet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
  2. Microsoft coco: Common objects in context. In European Conference on Computer Vision, pages 740–755. Springer, 2014.
  3. The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2):303–338, 2010.
  4. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123(1):32–73, 2017.
  5. Are we ready for autonomous driving? the kitti vision benchmark suite. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3354–3361. IEEE, 2012.
  6. The" something something" video database for learning and evaluating visual common sense. In Proceedings of the IEEE International Conference on Computer Vision, pages 5842–5850, 2017.
  7. Actor and observer: Joint modeling of first and third-person videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7396–7404, 2018.
  8. Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes. arXiv preprint arXiv:1711.00199, 2017.
  9. Jrdb: A dataset and benchmark of egocentric robot visual perception of humans in built environments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
  10. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 961–970, 2015.
  11. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3608–3617, 2018.
  12. M. A. Marcinkiewicz. Building a large annotated corpus of english: The penn treebank. Using Large Corpora, page 273, 1994.
  13. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018.
  14. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016.
  15. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 1631–1642, 2013.
  16. Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision, pages 2425–2433, 2015.
  17. Rearrangement: A challenge for embodied ai. arXiv preprint arXiv:2011.01975, 2020.
  18. Visual room rearrangement. arXiv preprint arXiv:2103.16544, 2021.
  19. The threedworld transport challenge: A visually guided task-and-motion planning benchmark for physically realistic embodied ai. arXiv preprint arXiv:2103.14025, 2021.
  20. X. Puig et al. Virtualhome: Simulating household activities via programs. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018.
  21. Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 10740–10749, 2020.
  22. Interactive gibson benchmark: A benchmark for interactive navigation in cluttered environments. IEEE Robotics and Automation Letters, 5(2):713–720, 2020.
  23. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on Robot Learning, pages 1094–1100. PMLR, 2020.
  24. Rlbench: The robot learning benchmark & learning environment. IEEE Robotics and Automation Letters, 2020.
  25. Habitat: A platform for embodied ai research. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9339–9347, 2019.
  26. Habitat 2.0: Training home assistants to rearrange their habitat. In Advances in Neural Information Processing Systems, volume 34, 2021.
  27. Behavior: Benchmark for everyday household activities in virtual, interactive, and ecological environments. In Conference on Robot Learning, pages 477–490. PMLR, 2022.
  28. robosuite: A modular simulation framework and benchmark for robot learning. arXiv preprint arXiv:2009.12293, 2020.
  29. Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018.
  30. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
  31. Gathering strength, gathering storms: The one hundred year study on artificial intelligence (AI100) 2021 study panel report. Technical report, Stanford University, 2021.
  32. M. O. Riedl. Human-centered artificial intelligence and machine learning. Human Behavior and Emerging Technologies, 1(1):33–36, 2019.
  33. W. Xu. Toward human-centered ai: a perspective from human-computer interaction. Interactions, 26(4):42–46, 2019.
  34. B. Shneiderman. Bridging the gap between ethics and practice: guidelines for reliable, safe, and trustworthy human-centered ai systems. ACM Transactions on Interactive Intelligent Systems (TiiS), 10(4):1–31, 2020.
  35. Robocup: A challenge problem for ai. AI magazine, 18(1):73–73, 1997.
  36. Robocup@home: Scientific competition and benchmarking for domestic service robots. Interaction Studies, 10(3):392–426, 2009.
  37. Robocup@ home: Analysis and results of evolving competitions for domestic and service robots. Artificial Intelligence, 229:258–281, 2015.
  38. The DARPA Urban Challenge: Autonomous Vehicles in City Traffic, volume 56. springer, 2009.
  39. The darpa robotics challenge finals: Results and perspectives. Journal of Field Robotics, 34(2):229–240, 2017.
  40. Analysis and observations from the first amazon picking challenge. IEEE Transactions on Automation Science and Engineering, 15(1):172–188, 2016.
  41. Lessons from the amazon picking challenge: four aspects of building robotic systems. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, pages 4831–4835, 2017.
  42. Mobile manipulation hackathon: Moving into real world applications. IEEE Robotics & Automation Magazine, pages 2–14, 2021.
  43. Disect: A differentiable simulation engine for autonomous robotic cutting. arXiv preprint arXiv:2105.12244, 2021.
  44. Doorgym: A scalable door opening environment and baseline agent. arXiv preprint arXiv:1908.01887, 2019.
  45. Softgym: Benchmarking deep reinforcement learning for deformable object manipulation. In Conference on Robot Learning, 2020.
  46. Nvidia, Corp. Physx. https://developer.nvidia.com/physx-sdk, 2022. Accessed: 2022-06-10.
  47. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  48. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, pages 1861–1870. PMLR, 2018.
  49. M. Jordan and A. Perez. Optimal bidirectional rapidly-exploring random trees. Technical Report MIT-CSAIL-TR-2013-021, Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, 2013.
  50. U.S. Bureau of Labor Statistics. American Time Use Survey. https://www.bls.gov/tus/, 2019.
  51. European Commission. Harmonised european time use surveys. https://ec.europa.eu/eurostat/web/time-use-surveys, 2010.
  52. Multinational time use study. https://www.timeuse.org/mtus, 2020.
  53. wikiHow, Inc. wikihow. https://www.wikihow.com, 2021. Accessed: 2021-06-16.
  54. SAPIEN: A simulated part-based interactive environment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11097–11107, 2020.
  55. Maniskill: Generalizable manipulation skill benchmark with large-scale demonstrations. arXiv preprint arXiv:2107.14483, 2021.
  56. Rfuniverse: A physics-based action-centric interactive environment for everyday household tasks. arXiv preprint arXiv:2202.00199, 2022.
  57. G. A. Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41, 1995.
  58. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901, 2020.
  59. igibson 2.0: Object-centric simulation for robot learning of everyday household tasks. In Annual Conference on Robot Learning, 2021.
  60. A. G. Barto and S. Mahadevan. Recent advances in hierarchical reinforcement learning. Discrete Event Dynamic Systems, 13(1):41–77, 2003.
  61. S. M. LaValle. Planning Algorithms. Cambridge University Press, 2006.
  62. Rrt-connect: An efficient approach to single-query path planning. In Proceedings IEEE International Conference on Robotics and Automation, volume 2, pages 995–1001. IEEE, 2000.
  63. Manipulathor: A framework for visual object manipulation. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 4497–4506, 2021.
  64. Hrl4in: Hierarchical reinforcement learning for interactive navigation with mobile manipulators. In Conference on Robot Learning, pages 603–616. PMLR, 2020.
  65. Towards practical credit assignment for deep reinforcement learning. arXiv preprint arXiv:2106.04499, 2021.
  66. Exploration in deep reinforcement learning: a comprehensive survey. arXiv preprint arXiv:2109.06668, 2021.
  67. Deep exploration via randomized value functions. Journal of Machine Learning Research, 20(124):1–62, 2019.
  68. S. Hochreiter. The vanishing gradient problem during learning recurrent neural nets and problem solutions. Int. J. Uncertain. Fuzziness Knowl.-Based Syst., 6(2):107–116, 1998.
  69. ReLMoGen: Leveraging motion generation in reinforcement learning for mobile manipulation. In IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2020.
  70. J. Redmon and A. Farhadi. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018.
  71. M. Bjelonic. YOLO ROS: Real-time object detection for ROS. https://github.com/leggedrobotics/darknet_ros, 2016–2018.
  72. Probabilistic Robotics. MIT Press, 2005.
  73. Secant: Self-expert cloning for zero-shot generalization of visual policies. arXiv preprint arXiv:2106.09678, 2021.
  74. M. Koupaee and W. Y. Wang. Wikihow: A large scale text summarization dataset. arXiv preprint arXiv:1810.09305, 2018.
  75. R. Likert. A technique for the measurement of attitudes. Archives of psychology, 1932.
  76. D. C. Montgomery. Design and analysis of experiments. John Wiley & Sons, Inc., Hoboken, NJ, eighth edition, 2013.
  77. Running experiments on amazon mechanical turk. Judgment and Decision making, 5(5):411–419, 2010.
  78. P. R. Center. Research in the crowdsourcing age, a case study. Technical report, Washington, D.C., July 2016. URL https://www.pewresearch.org/internet/2016/07/11/research-in-the-crowdsourcing-age-a-case-study/.
  79. Contextual string embeddings for sequence labeling. In COLING 2018, 27th International Conference on Computational Linguistics, pages 1638–1649, 2018.
  80. M. Fox and D. Long. PDDL2.1: An extension to PDDL for expressing temporal planning domains. Journal of Artificial Intelligence Research, 20:61–124, dec 2003. doi:10.1613/jair.1129. URL https://doi.org/10.1613%2Fjair.1129.
  81. Construction-planning models in minecraft. In Proceedings of the 2nd ICAPS Workshop on Hierarchical Planning (HPlan 2019), pages 1–5, 2019.
  82. TurboSquid, Inc. Turbosquid. https://www.turbosquid.com/, 2022. Accessed: 2022-06-24.
  83. Niantic, Inc. Scaniverse. https://scaniverse.com, 2022. Accessed: 2022-06-24.
  84. Ros: an open-source robot operating system. In ICRA workshop on open source software, volume 3, page 5. Kobe, Japan, 2009.
  85. How automation could affect employment for women in the united kingdom and minorities in the united states. McKinsey Digital, 2019. URL tinyurl.com/h883dm9n.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (35)
  1. Chengshu Li (32 papers)
  2. Ruohan Zhang (34 papers)
  3. Josiah Wong (10 papers)
  4. Cem Gokmen (9 papers)
  5. Sanjana Srivastava (12 papers)
  6. Roberto Martín-Martín (79 papers)
  7. Chen Wang (600 papers)
  8. Gabrael Levine (2 papers)
  9. Wensi Ai (9 papers)
  10. Hang Yin (77 papers)
  11. Michael Lingelbach (11 papers)
  12. Minjune Hwang (7 papers)
  13. Ayano Hiranaka (4 papers)
  14. Sujay Garlanka (1 paper)
  15. Arman Aydin (2 papers)
  16. Sharon Lee (6 papers)
  17. Jiankai Sun (53 papers)
  18. Mona Anvari (2 papers)
  19. Manasi Sharma (7 papers)
  20. Dhruva Bansal (2 papers)
Citations (21)

Summary

  • The paper introduces BEHAVIOR-1K, a human-centered benchmark based on survey insights to define 1,000 everyday activities for evaluating robotic assistance.
  • It employs OmniGibson, an advanced simulation environment that realistically models physical dynamics and complex object interactions.
  • Evaluations reveal significant challenges requiring long-horizon planning and precise manipulation, highlighting the simulation-to-reality gap in current robotic learning.

A Comprehensive Exploration of BEHAVIOR-1K: Challenges in Human-Centered Robotics

Introduction to BEHAVIOR-1K

Recent advancements in robotics and embodied AI have emphasized the necessity for considerable diversification and enhancement in the simulation environments and benchmarks to evaluate them. BEHAVIOR-1K presents itself as an ambitious initiative towards formulating a benchmark that encapsulates a wide array of everyday activities based directly on human necessities and inclinations. Grounded in results from a meticulously conducted survey involving 1,461 participants, BEHAVIOR-1K outlines the creation and detailed composition of a dataset containing definitions for 1,000 everyday activities. This is coupled with an innovative simulation environment, OmniGibson, which realizes these activities in virtual, interactive, and ecologically realistic settings. Evaluations highlight the sophisticated challenge BEHAVIOR-1K poses, stretching the capabilities of even the most advanced robot learning algorithms.

Survey Foundation

The ideation phase of BEHAVIOR-1K was underpinned by an extensive survey aimed at capturing a broad spectrum of human expectations from robotic assistance. Activities spanned typical household chores to more specific cleaning and cooking tasks. Remarkably, the distribution of preferred activities portrayed considerable variance, reinforcing the need for a nuanced approach towards developing an embodied AI benchmark. Key insights pointed towards prioritizing diversity in scene types, objects, and the realism of physical processes and interactions involved in the activities.

The BEHAVIOR-1K Dataset

Central to BEHAVIOR-1K is its dataset, which robustly defines the 1,000 activities over 50 diverse scenes constituting over 9,000 objects. Each object and scene in the dataset is annotated with considerable detail to ensure simulations can be as close to real-world scenarios as possible. Activities are defined using BEHAVIOR Domain Definition Language (BDDL), highlighting initial and goal conditions alongside necessary object interactions – providing a comprehensive lexical framework conducive to building and evaluating robotic models.

OmniGibson Simulation Environment

OmniGibson represents a qualitative leap in simulation realism and functionality. Building on Nvidia's Omniverse and PhysX 5, it caters to the physical simulation of rigid and deformable bodies, along with fluid dynamics. Extended states such as temperature, wetness, and toggled states are meticulously tracked, facilitating the accurate depiction of intricate physical processes like cooking. By combining realistic physics and environmental rendering, OmniGibson sets a new standard for embodied AI simulations.

Challenges and Initial Evaluations

Preliminary evaluations using BEHAVIOR-1K demonstrate the comprehensive challenge the benchmark poses. Notably, activities necessitate long-horizon planning and sophisticated manipulation skills, areas where current methodologies falter. These evaluations also underpin the existence and quantifiable impact of the simulation-to-reality gap, providing valuable insights for future research in robotic learning.

Implications and Future Directions

BEHAVIOR-1K, with its human-centered design and innovative simulation capabilities, represents a significant step forward in the quest for advanced robotic assistance. It not only presents an expansive array of challenges for the research community but also sets up a framework for continual evolution and refinement in embodied AI. Looking forward, the adaptability and expandability of BEHAVIOR-1K promise a vibrant avenue for future developments, potentially leading to breakthroughs in robotic applications tailored to serve human needs more efficiently.

ProjectWebsite:https://behavior.stanford.eduProject Website: https://behavior.stanford.edu.