Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

BEHAVIOR Vision Suite: Customizable Dataset Generation via Simulation (2405.09546v1)

Published 15 May 2024 in cs.CV

Abstract: The systematic evaluation and understanding of computer vision models under varying conditions require large amounts of data with comprehensive and customized labels, which real-world vision datasets rarely satisfy. While current synthetic data generators offer a promising alternative, particularly for embodied AI tasks, they often fall short for computer vision tasks due to low asset and rendering quality, limited diversity, and unrealistic physical properties. We introduce the BEHAVIOR Vision Suite (BVS), a set of tools and assets to generate fully customized synthetic data for systematic evaluation of computer vision models, based on the newly developed embodied AI benchmark, BEHAVIOR-1K. BVS supports a large number of adjustable parameters at the scene level (e.g., lighting, object placement), the object level (e.g., joint configuration, attributes such as "filled" and "folded"), and the camera level (e.g., field of view, focal length). Researchers can arbitrarily vary these parameters during data generation to perform controlled experiments. We showcase three example application scenarios: systematically evaluating the robustness of models across different continuous axes of domain shift, evaluating scene understanding models on the same set of images, and training and evaluating simulation-to-real transfer for a novel vision task: unary and binary state prediction. Project website: https://behavior-vision-suite.github.io/

BEHAVIOR Vision Suite: Customizable Dataset Generation via Simulation

Introduction

Creating and evaluating computer vision models often requires large datasets that cater to specific research needs. However, real-world datasets usually fall short due to limitations like acquisition costs, inaccuracies, and fixed configurations. In response, synthetic data generation offers an alternative, but existing tools often lack quality and diversity. The BEHAVIOR Vision Suite (BVS) aims to tackle these challenges by offering a toolkit for generating fully customized synthetic datasets.

What is BEHAVIOR Vision Suite (BVS)?

BVS is built on the BEHAVIOR-1K benchmark and consists of two main components:

  1. Extended BEHAVIOR-1K Assets: A diverse collection of over 8,000 object models and 1,000 scene instances. These assets cover a wide range of categories and include features like articulated joints and fluid dynamics for realistic simulations.
  2. Customizable Dataset Generator: A robust software tool that uses these assets to create tailored datasets. The generator supports a wide variety of parameters at the scene, object, and camera levels, ensuring physical plausibility through a physics engine.

Key Features

Here's what makes BVS special:

  • Comprehensive Labels: Generates labels at image, object, and pixel levels (e.g., scene graphs, point clouds, segmentation masks).
  • Diverse and Photorealistic: Covers a wide array of indoor scenes and objects with high visual and physical fidelity.
  • Customizability: Users can adjust parameters like object poses, semantic states, lighting conditions, and camera settings.
  • User-friendly Tooling: Includes utilities for generating data tailored to specific research needs.

Applications and Experiments

BVS's utility is demonstrated through three primary applications, showcasing its robustness and versatility:

1. Parametric Model Evaluation

In this application, BVS is used to test model robustness against various parameters like lighting, occlusion, and object articulation. The dataset includes up to 500 video clips for each parameter, revealing significant performance differences among current state-of-the-art (SOTA) models. For instance, models generally struggled with detecting objects under low-light conditions or when objects were partially occluded. This kind of systematic evaluation is difficult to achieve with real-world datasets but is easily manageable with BVS.

2. Holistic Scene Understanding

BVS generated a large-scale dataset containing over 266,000 frames, each annotated with various labels like segmentation masks and depth maps. This comprehensive dataset was used to benchmark SOTA models on several tasks, including object detection, segmentation, depth estimation, and point cloud reconstruction. Interestingly, the relative performance of these models on the synthetic dataset closely matched their performance on real-world datasets, validating the photorealism and utility of BVS-generated data.

Figure

3. Object States and Relations Prediction

This application focuses on a novel vision task: predicting object states and their relationships. BVS generated 12,500 images with labels like "open," "closed," "on top of," and "inside." When tested on real-world images, a model trained solely on this synthetic dataset achieved impressive accuracy, highlighting BVS's potential for sim2real transfer. Training with this synthetic data outperformed zero-shot models like CLIP, proving the effectiveness of task-specific training.

Figure

Implications and Future Directions

The capabilities of BVS offer practical and theoretical benefits:

  • Practical: Researchers can create large-scale, customized datasets for specific tasks, reducing the reliance on costly and inflexible real-world data.
  • Theoretical: The ability to systematically vary parameters and observe model performance can help identify weaknesses and guide improvements in computer vision models.

Future developments could include expanding the range of customizable parameters and enhancing the photorealism of generated datasets. This would make BVS even more valuable for diverse applications in computer vision research and beyond.

This overview of the BEHAVIOR Vision Suite highlights its potential to revolutionize how computer vision datasets are created and utilized. With its extensive customization options and high-quality outputs, BVS stands as a powerful tool for advancing computer vision research.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (75)
  1. ARKitscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile RGB-d data. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021.
  2. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the ieee conference on computer vision and pattern recognition, pages 961–970, 2015.
  3. Going beyond nouns with vision & language models using synthetic data. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 20155–20165, 2023.
  4. Matterport3d: Learning from rgb-d data in indoor environments. International Conference on 3D Vision (3DV), 2017.
  5. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 2017.
  6. Objaverse: A universe of annotated 3d objects. arXiv preprint arXiv:2212.08051, 2022a.
  7. Procthor: Large-scale embodied ai using procedural generation. Advances in Neural Information Processing Systems, 35:5982–5994, 2022b.
  8. Objaverse-xl: A universe of 10m+ 3d objects. arXiv preprint arXiv:2307.05663, 2023.
  9. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  10. Omnidata: A scalable pipeline for making multi-task mid-level vision datasets from 3d scans, 2021.
  11. The pascal visual object classes (voc) challenge. International journal of computer vision, 88:303–338, 2010.
  12. Eva-02: A visual representation for neon genesis. arXiv preprint arXiv:2303.11331, 2023.
  13. 3d-future: 3d furniture shape with texture. International Journal of Computer Vision, 129:3313–3337, 2021.
  14. Threedworld: A platform for interactive multi-modal physical simulation. arXiv preprint arXiv:2007.04954, 2020.
  15. Neural-sim: Learning to generate training data with nerf. In European Conference on Computer Vision, pages 477–493. Springer, 2022a.
  16. Em-paste: Em-guided cut-paste with dall-e augmentation for image-level weakly supervised instance segmentation, 2022b.
  17. Dall-e for detection: Language-driven compositional image synthesis for object detection. arXiv preprint arXiv:2206.09592, 2022c.
  18. Beyond generation: Harnessing text to image models for object detection and segmentation. arXiv preprint arXiv:2309.05956, 2023.
  19. 3d copy-paste: Physically plausible object insertion for monocular 3d detection. Advances in Neural Information Processing Systems, 36, 2024.
  20. Are we ready for autonomous driving? the kitti vision benchmark suite. In 2012 IEEE conference on computer vision and pattern recognition, pages 3354–3361. IEEE, 2012.
  21. Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015.
  22. The” something something” video database for learning and evaluating visual common sense. In Proceedings of the IEEE international conference on computer vision, pages 5842–5850, 2017.
  23. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995–19012, 2022.
  24. Kubric: a scalable dataset generator. 2022.
  25. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3608–3617, 2018.
  26. Is synthetic data from generative models ready for image recognition? In The Eleventh International Conference on Learning Representations, 2022.
  27. The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8340–8349, 2021a.
  28. Natural adversarial examples. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15262–15271, 2021b.
  29. IDEA-Research. Grounding sam. https://github.com/IDEA-Research/Grounded-Segment-Anything, 2023.
  30. Bounding-box channels for visual relationship detection. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16, pages 682–697. Springer, 2020.
  31. gradslam: Automagically differentiable slam. arXiv preprint arXiv:1910.10672, 2019.
  32. 3d common corruptions and data augmentation, 2022.
  33. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  34. Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:1712.05474, 2017.
  35. Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int J Comput Vis, 123:32–73, 2017.
  36. igibson 2.0: Object-centric simulation for robot learning of everyday household tasks. arXiv preprint arXiv:2108.03272, 2021.
  37. Behavior-1k: A benchmark for embodied ai with 1,000 everyday activities and realistic simulation. In Proceedings of The 6th Conference on Robot Learning, pages 80–93. PMLR, 2023.
  38. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10965–10975, 2022.
  39. Interiornet: Mega-scale multi-sensor photo-realistic indoor scenes dataset. arXiv preprint arXiv:1809.00716, 2018.
  40. Openrooms: An end-to-end open framework for photorealistic indoor scene datasets. arXiv preprint arXiv:2007.12868, 2020.
  41. Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF international conference on computer vision, pages 7083–7093, 2019.
  42. Microsoft COCO: common objects in context. CoRR, abs/1405.0312, 2014a.
  43. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014b.
  44. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
  45. Khaled Mamou. Volumetric approximate convex decomposition. In Game Engine Gems 3, chapter 12, pages 141–158. A K Peters / CRC Press, 2016.
  46. The neuro-symbolic concept learner: Interpreting scenes, words, and sentences from natural supervision. In International Conference on Learning Representations, 2018.
  47. Jrdb: A dataset and benchmark of egocentric robot visual perception of humans in built environments. IEEE transactions on pattern analysis and machine intelligence, 2021.
  48. Grounding predicates through actions. In 2022 International Conference on Robotics and Automation (ICRA), pages 3498–3504, 2022.
  49. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
  50. Pushmeet Kohli Nathan Silberman, Derek Hoiem and Rob Fergus. Indoor segmentation and support inference from rgbd images. In ECCV, 2012.
  51. idisc: Internal discretization for monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21477–21487, 2023.
  52. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  53. Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai. arXiv preprint arXiv:2109.08238, 2021.
  54. Vision transformers for dense prediction. ArXiv preprint, 2021.
  55. Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10912–10922, 2021.
  56. igibson 1.0: A simulation environment for interactive tasks in large realistic scenes. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 7520–7527. IEEE, 2021.
  57. Actor and observer: Joint modeling of first and third-person videos. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  58. Sun rgb-d: A rgb-d scene understanding benchmark suite. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  59. Behavior: Benchmark for everyday household activities in virtual, interactive, and ecological environments. In Conference on Robot Learning, pages 477–490. PMLR, 2022.
  60. Habitat 2.0: Training home assistants to rearrange their habitat. Advances in Neural Information Processing Systems, 34:251–266, 2021.
  61. Unity Technologies. Unity synthhomes: A synthetic home interior dataset generator. https://github.com/Unity-Technologies/SynthHomes, 2022.
  62. Action recognition with improved trajectories. In Proceedings of the IEEE international conference on computer vision, pages 3551–3558, 2013.
  63. Neural video depth stabilizer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9466–9476, 2023.
  64. Neural light field estimation for street scenes with differentiable virtual object insertion. In European Conference on Computer Vision, pages 380–397. Springer, 2022.
  65. Approximate convex decomposition for 3d meshes with collision-aware concavity and tree search. ACM Transactions on Graphics (TOG), 41(4):1–18, 2022.
  66. Gibson Env: real-world perception for embodied agents. In Computer Vision and Pattern Recognition (CVPR), 2018 IEEE Conference on. IEEE, 2018.
  67. Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes. arXiv preprint arXiv:1711.00199, 2017.
  68. Open-vocabulary panoptic segmentation with text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2955–2966, 2023.
  69. Habitat-matterport 3d semantics dataset. arXiv preprint arXiv:2210.05633, 2022.
  70. Scannet++: A high-fidelity dataset of 3d indoor scenes. In Proceedings of the International Conference on Computer Vision (ICCV), 2023.
  71. Taskonomy: Disentangling task transfer learning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2018.
  72. A simple framework for open-vocabulary segmentation and detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1020–1031, 2023a.
  73. Recognize anything: A strong image tagging model. arXiv preprint arXiv:2306.03514, 2023b.
  74. Nice-slam: Neural implicit scalable encoding for slam. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12786–12796, 2022.
  75. Object detection in 20 years: A survey. Proceedings of the IEEE, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (23)
  1. Yunhao Ge (29 papers)
  2. Yihe Tang (5 papers)
  3. Jiashu Xu (21 papers)
  4. Cem Gokmen (9 papers)
  5. Chengshu Li (32 papers)
  6. Wensi Ai (9 papers)
  7. Benjamin Jose Martinez (1 paper)
  8. Arman Aydin (2 papers)
  9. Mona Anvari (2 papers)
  10. Ayush K Chakravarthy (1 paper)
  11. Hong-Xing Yu (37 papers)
  12. Josiah Wong (10 papers)
  13. Sanjana Srivastava (12 papers)
  14. Sharon Lee (6 papers)
  15. Shengxin Zha (6 papers)
  16. Laurent Itti (57 papers)
  17. Yunzhu Li (56 papers)
  18. Roberto Martín-Martín (79 papers)
  19. Miao Liu (98 papers)
  20. Pengchuan Zhang (58 papers)
Citations (5)