A Dataset for Developing and Benchmarking Active Vision (1702.08272v2)

Published 27 Feb 2017 in cs.CV

Abstract: We present a new public dataset with a focus on simulating robotic vision tasks in everyday indoor environments using real imagery. The dataset includes 20,000+ RGB-D images and 50,000+ 2D bounding boxes of object instances densely captured in 9 unique scenes. We train a fast object category detector for instance detection on our data. Using the dataset we show that, although increasingly accurate and fast, the state of the art for object detection is still severely impacted by object scale, occlusion, and viewing direction all of which matter for robotics applications. We next validate the dataset for simulating active vision, and use the dataset to develop and evaluate a deep-network-based system for next best move prediction for object classification using reinforcement learning. Our dataset is available for download at cs.unc.edu/~ammirato/active_vision_dataset_website/.

Citations (176)

View on Semantic Scholar

Summary

The paper presents a new dataset of over 20,000 real RGB-D images from indoor environments with 50,000 annotated objects, designed to simulate active vision tasks for robots.
The dataset allows researchers to simulate robotic navigation and investigate how viewpoint changes and occlusions affect object detection performance.
This resource provides a valuable benchmark for developing and comparing active vision algorithms, enabling research without the need for physical robots or synthetic models.

Overview of "A Dataset for Developing and Benchmarking Active Vision"

The paper presents a pioneering contribution to the field of robotic vision through the introduction of a novel dataset explicitly curated to simulate active vision tasks in typical indoor environments using real RGB-D imagery. With a large collection of over 20,000 images and 50,000 bounding boxes capturing object instances across nine distinct scenes, this dataset provides a significant resource for investigating the challenges inherent in robotic perception, particularly object detection influenced by scale, occlusion, and viewing angles.

Dataset Characteristics and Collection Methodology

The dataset comprises densely sampled RGB-D images of various everyday indoor environments, such as kitchens, living rooms, and offices, effectively allowing virtual simulation of robotic navigation through these scenes. The collection methodology involves the strategic placement of cameras, sampling every 30 degrees, across a grid spaced 30 centimeters apart. This approach ensures the dataset encapsulates a rich geometric diversity, enabling robust simulation of dynamic robotic vision tasks.

Object Detection and Active Vision Simulation

Addressing the task of object detection, the authors adapt a state-of-the-art object category detector, SSD, to instance detection, underscoring the applicability of deep CNNs to this domain despite inherent challenges posed by small object scales and occlusions. The paper highlights the detector's sensitivity to viewpoint changes and how object recognition performance tends to degrade under different viewing conditions—a central theme underscoring the importance of active vision in robotic applications.

Furthermore, the paper validates the efficacy of using the densely sampled dataset for simulating robotic vision systems programmed to actively select viewpoints that optimize recognition accuracy. Through reinforcement learning, a deep network is trained to predict the next best move for object classification, leveraging the dataset's geometric properties to optimize classifier performance across virtual navigations.

Contributions and Implications

This research contributes significantly to bridging gaps identified in traditional computer vision datasets, which are often limited by biases stemming from human photographers and online image sources. In contrast, the proposed dataset reflects more realistic conditions pertinent to robotic operations, offering insights into variations in scale, occlusion levels, and non-frontal object views.

The implications of this work are manifold. Practically, the dataset provides a robust platform for developing advanced vision systems for robotic applications without needing access to physical robots or synthetic models. Theoretically, it provides valuable benchmarks for comparing the efficacy of different approaches in active vision scenarios, contributing to a deeper understanding of how perception systems can dynamically adapt to changing environmental stimuli.

Future Directions

Looking forward, this dataset opens avenues for further exploration into multi-view object classification, the integration of recurrent models to account for motion history, and potential applications in more complex real-world settings extending beyond indoor environments. The paper sets a foundation for continued research into novel active vision algorithms, fostering innovation in AI-driven robotic perception and interactivity.