RGB-Event Fusion with Self-Attention for Collision Prediction

Published 7 May 2025 in cs.RO and cs.CV | (2505.04258v2)

Abstract: Ensuring robust and real-time obstacle avoidance is critical for the safe operation of autonomous robots in dynamic, real-world environments. This paper proposes a neural network framework for predicting the time and collision position of an unmanned aerial vehicle with a dynamic object, using RGB and event-based vision sensors. The proposed architecture consists of two separate encoder branches, one for each modality, followed by fusion by self-attention to improve prediction accuracy. To facilitate benchmarking, we leverage the ABCD [8] dataset collected that enables detailed comparisons of single-modality and fusion-based approaches. At the same prediction throughput of 50Hz, the experimental results show that the fusion-based model offers an improvement in prediction accuracy over single-modality approaches of 1% on average and 10% for distances beyond 0.5m, but comes at the cost of +71% in memory and + 105% in FLOPs. Notably, the event-based model outperforms the RGB model by 4% for position and 26% for time error at a similar computational cost, making it a competitive alternative. Additionally, we evaluate quantized versions of the event-based models, applying 1- to 8-bit quantization to assess the trade-offs between predictive performance and computational efficiency. These findings highlight the trade-offs of multi-modal perception using RGB and event-based cameras in robotic applications.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

RGB-Event Fusion with Self-Attention for Collision Prediction

This paper presents a novel framework for collision prediction by integrating RGB and event-based vision sensors using a neural network architecture enhanced with self-attention mechanisms. The research addresses the critical need for robust and real-time obstacle avoidance in unmanned aerial vehicles (UAVs) operating in dynamic environments. By harnessing the complementary strengths of RGB and event-based cameras, the proposed methodology aims to predict both the time and spatial location of a potential collision more accurately than using either modality separately.

Methodology and Key Findings

The architecture employs two separate encoder branches to process RGB images and event streams independently before merging them through a self-attention mechanism. This fusion strategy is designed to capture the temporal richness of event-based data and the spatial and color attributes of RGB data, thereby enhancing the prediction accuracy.

The experimental findings, evaluated using the ABCD dataset, reveal that at an operational prediction frequency of 50 Hz, the proposed fusion model achieves a modest 1% improvement in prediction accuracy over single-modality models on average. Notably, for collisions occurring at distances greater than 0.5 meters, the fusion approach results in an average accuracy enhancement of 10%. Despite these performance gains, the fusion model incurs significant computational overhead, with memory usage and FLOPs increasing by 71% and 105%, respectively, compared to single-modality approaches.

In a single-modality context, the event-based model notably surpasses the RGB-only model, boasting a 4% improvement in spatial precision and a substantial 26% reduction in time prediction error, all while maintaining a similar computational footprint. This underscores the potential of event cameras in applications demanding rapid motion analysis and low-light condition adaptability.

Quantization experiments further assess the trade-offs between precision and efficiency. Applying low-bit (1- to 8-bit) quantization strategies to the event-based model illustrates how predictive accuracy can be preserved while significantly reducing computational demands, thus enabling deployment in resource-constrained environments.

Implications and Future Directions

By demonstrating the relative strengths and computational demands associated with multi-modal perception systems, this work contributes significantly to the discourse on sensor fusion strategies in robotic applications. The insights gained regarding the fusion of temporal and spatial modalities suggest promising avenues for deploying more intelligent and responsive collision avoidance systems in UAVs.

The research not only highlights the benefits of event-based data in enhancing temporal resolution but also pinpoints the increased resource requirements associated with multi-modal fusion. Future investigations could delve deeper into optimizing computational efficiency, perhaps by leveraging advanced techniques such as spiking neural networks or tailored hardware accelerators to mitigate the additional resource consumption inherently linked to multi-modal data processing.

Ultimately, this study lays the groundwork for further explorations into enriched sensory fusion methods and presents a compelling case for the expanded use of event-based cameras in real-world robotic vision systems.