HARDVS: Revisiting Human Activity Recognition with Dynamic Vision Sensors (2211.09648v1)

Published 17 Nov 2022 in cs.CV, cs.AI, and cs.NE

Abstract: The main streams of human activity recognition (HAR) algorithms are developed based on RGB cameras which are suffered from illumination, fast motion, privacy-preserving, and large energy consumption. Meanwhile, the biologically inspired event cameras attracted great interest due to their unique features, such as high dynamic range, dense temporal but sparse spatial resolution, low latency, low power, etc. As it is a newly arising sensor, even there is no realistic large-scale dataset for HAR. Considering its great practical value, in this paper, we propose a large-scale benchmark dataset to bridge this gap, termed HARDVS, which contains 300 categories and more than 100K event sequences. We evaluate and report the performance of multiple popular HAR algorithms, which provide extensive baselines for future works to compare. More importantly, we propose a novel spatial-temporal feature learning and fusion framework, termed ESTF, for event stream based human activity recognition. It first projects the event streams into spatial and temporal embeddings using StemNet, then, encodes and fuses the dual-view representations using Transformer networks. Finally, the dual features are concatenated and fed into a classification head for activity prediction. Extensive experiments on multiple datasets fully validated the effectiveness of our model. Both the dataset and source code will be released on \url{https://github.com/Event-AHU/HARDVS}.

Authors (8)

Xiao Wang (507 papers)
Zongzhen Wu (3 papers)
Bo Jiang (235 papers)
Zhimin Bao (4 papers)
Lin Zhu (97 papers)
Guoqi Li (90 papers)
Yaowei Wang (149 papers)
Yonghong Tian (184 papers)

Citations (29)

View on Semantic Scholar

Summary

Overview of "HARDVS: Revisiting Human Activity Recognition with Dynamic Vision Sensors"

The paper "HARDVS: Revisiting Human Activity Recognition with Dynamic Vision Sensors" introduces a significant contribution to the field of human activity recognition (HAR) through the development of a new large-scale benchmark dataset, termed HARDVS. This work emerges from the limitations and challenges faced by conventional RGB camera-based HAR systems, such as issues with illumination, privacy concerns, and high energy consumption. The authors propose an innovative approach by utilizing dynamic vision sensors, also known as event cameras, which offer advantages like low latency, high dynamic range, and reduced energy consumption.

Key Contributions

HARDVS Dataset: The authors present the HARDVS dataset, which includes over 100,000 event sequences across 300 diverse categories of human activities. This dataset uniquely captures nuances of event data in real-world scenarios by considering factors like multi-view angles, varying illumination, motion speeds, dynamic backgrounds, and occlusions. Importantly, this is one of the first large-scale realistic datasets specifically formatted for event-based HAR challenges.
ESTF Framework: The authors introduce an Event-based Spatial-Temporal Feature learning and fusion framework, termed ESTF. This framework innovatively projects event streams into spatial and temporal embeddings using a network called StemNet, followed by encoding and fusion of dual-view representations using Transformer networks. This modality allows for improved context-aware feature learning, optimizing the HAR task.
Exhaustive Evaluation: Performance evaluations of existing state-of-the-art HAR algorithms on the HARDVS dataset are conducted. The authors go further to offer extensive baselines for future works, underscoring the practicality and relevance of this dataset in advancing event-based HAR research.

Experimental Insights

The experimental results showcased in the paper validate the effectiveness of the HARDVS dataset and the proposed ESTF model. Compared to existing models, ESTF demonstrated notable improvement in capturing the spatial-temporal dynamics in event streams. On competing datasets like N-Caltech101 and ASL-DVS, ESTF delivered superior recognition results, thereby setting a new benchmark in event-based HAR analytics.

Implications and Future Directions

The introduction of the HARDVS dataset marks a significant leap in addressing the shortcomings of traditional HAR methods reliant on RGB sensors. By harnessing the capabilities of event cameras, the paper opens up new possibilities in real-time, low-energy, privacy-preserving HAR applications. This aligns with the growing demand for autonomous systems within smart environments where computational efficiency and privacy compliance are paramount.

Practically, the implications of this work extend to enhanced human-computer interaction, surveillance, autonomous driving, and healthcare, where accurate activity recognition plays a pivotal role. The dataset and model together form a foundation upon which more complex event-based recognition systems can be developed, facilitating advancements in neuro-inspired sensing technologies.

The conclusions drawn from this work underscore the potential for Transformer-based architectures in synthesizing spatial and temporal data, paving the way for their broader application in event-based recognition tasks. As future work, the community might explore synergistic integrations of event data with other sensor modalities, such as LiDAR or acoustic data, to further enhance perception and decision-making in artificial intelligence systems. Additionally, the exploration of few-shot learning or unsupervised approaches on the HARDVS dataset could offer insight into efficient model training in data-constrained environments.

In conclusion, this paper not only introduces an influential dataset but also provides robust methodologies and insights, potentially redefining the benchmarks for real-world HAR tasks.

HARDVS: Revisiting Human Activity Recognition with Dynamic Vision Sensors (2211.09648v1)

Summary

Overview of "HARDVS: Revisiting Human Activity Recognition with Dynamic Vision Sensors"

Key Contributions

Experimental Insights

Implications and Future Directions

Related Papers

GitHub

YouTube