Learning to Detect Objects with a 1 Megapixel Event Camera (2009.13436v2)

Published 28 Sep 2020 in cs.CV and cs.LG

Abstract: Event cameras encode visual information with high temporal precision, low data-rate, and high-dynamic range. Thanks to these characteristics, event cameras are particularly suited for scenarios with high motion, challenging lighting conditions and requiring low latency. However, due to the novelty of the field, the performance of event-based systems on many vision tasks is still lower compared to conventional frame-based solutions. The main reasons for this performance gap are: the lower spatial resolution of event sensors, compared to frame cameras; the lack of large-scale training datasets; the absence of well established deep learning architectures for event-based processing. In this paper, we address all these problems in the context of an event-based object detection task. First, we publicly release the first high-resolution large-scale dataset for object detection. The dataset contains more than 14 hours recordings of a 1 megapixel event camera, in automotive scenarios, together with 25M bounding boxes of cars, pedestrians, and two-wheelers, labeled at high frequency. Second, we introduce a novel recurrent architecture for event-based detection and a temporal consistency loss for better-behaved training. The ability to compactly represent the sequence of events into the internal memory of the model is essential to achieve high accuracy. Our model outperforms by a large margin feed-forward event-based architectures. Moreover, our method does not require any reconstruction of intensity images from events, showing that training directly from raw events is possible, more efficient, and more accurate than passing through an intermediate intensity image. Experiments on the dataset introduced in this work, for which events and gray level images are available, show performance on par with that of highly tuned and studied frame-based detectors.

Citations (212)

View on Semantic Scholar

Summary

The paper introduces the first public 1MP event camera dataset with 25M labeled bounding boxes for automotive object detection.
The paper employs a novel recurrent neural network architecture with temporal consistency loss to efficiently process asynchronous event data.
The paper demonstrates that direct event processing yields superior performance, matching frame-based methods in challenging real-world conditions.

An Evaluation of Object Detection Using High-Resolution Event Cameras

The paper "Learning to Detect Objects with a 1 Megapixel Event Camera" addresses the challenges and opportunities associated with deploying high-resolution event cameras for object detection, particularly within the context of automotive applications. Event cameras are a burgeoning technology in the field of computer vision, offering unique advantages such as high temporal resolution, low data rates, and expansive dynamic range. However, practical deployment has been lagging behind conventional frame-based systems due to limitations such as lower spatial resolution, insufficient large-scale datasets, and the lack of established deep learning architectures for event-based data.

Contributions and Methodology

The paper makes several significant contributions to the development of event-based vision systems:

High-Resolution Dataset Release: This research introduces the first public 1 megapixel dataset for event camera-based object detection. The dataset significantly exceeds 14 hours of automobile-related recordings and provides 25 million high-frequency labeled bounding boxes categorizing cars, pedestrians, and two-wheelers. Such a dataset constitutes a critical resource for training and benchmarking algorithms in the field.
Novel Recurrent Architecture: The authors propose a recurrent neural network architecture tailored for handling the unique data stream format of event cameras. The use of a temporal consistency loss function is pivotal in enhancing the training regimen, allowing the architecture to efficiently encode sequences of asynchronous events.
Direct Event Processing Demonstration: Through extensive experiments, the authors demonstrate that their model can operate without reconstructing intensity images, which often incurs additional computational costs. This shows direct training from raw event data not only improves efficiency but also enhances accuracy.

Key Findings

The proposed approach outperforms existing feed-forward event-based architectures with a substantial margin, demonstrating an ability to match the accuracy traditionally associated with frame-based systems. The results are convincing, showing that event-camera detectors can achieve performance parity with conventional grayscale vision algorithms on extensive vision tasks, marking a critical advancement in the adoption of event cameras for real-world applications.

Implications

The ability to process event data directly opens new avenues for high-speed, dynamic visual systems, particularly in environments with challenging lighting conditions where conventional sensors struggle. The release of a high-resolution, large-scale dataset substantially lowers barriers for further research and development, potentially leading to breakthroughs in neuromorphic vision systems. This has implications in various industries, from autonomous vehicles needing rapid reaction times to robotics operating in dynamic environments.

Future Directions

There are promising paths for future exploration and improvement. One potential area of advancement is the integration of event-based processing with neuromorphic hardware architectures, aiming for reduced computational overhead and further exploitation of the inherent sparseness in event data. Additionally, extending these methodologies to accommodate color event cameras could broaden the applicability and enhance the functionality of event-based detection systems.

This paper stands as a strategic contribution to the expansion of event-camera-based research, providing a foundational dataset, a novel method, and successful empirical demonstrations that pave the way for future technological innovations in event-based vision systems.

PDF Markdown