Lifting Monocular Events to 3D Human Poses (2104.10609v1)

Published 21 Apr 2021 in cs.CV

Abstract: This paper presents a novel 3D human pose estimation approach using a single stream of asynchronous events as input. Most of the state-of-the-art approaches solve this task with RGB cameras, however struggling when subjects are moving fast. On the other hand, event-based 3D pose estimation benefits from the advantages of event-cameras, especially their efficiency and robustness to appearance changes. Yet, finding human poses in asynchronous events is in general more challenging than standard RGB pose estimation, since little or no events are triggered in static scenes. Here we propose the first learning-based method for 3D human pose from a single stream of events. Our method consists of two steps. First, we process the event-camera stream to predict three orthogonal heatmaps per joint; each heatmap is the projection of of the joint onto one orthogonal plane. Next, we fuse the sets of heatmaps to estimate 3D localisation of the body joints. As a further contribution, we make available a new, challenging dataset for event-based human pose estimation by simulating events from the RGB Human3.6m dataset. Experiments demonstrate that our method achieves solid accuracy, narrowing the performance gap between standard RGB and event-based vision. The code is freely available at https://iit-pavis.github.io/lifting_events_to_3d_hpe.

Authors (3)

Gianluca Scarpellini (7 papers)
Pietro Morerio (51 papers)
Alessio Del Bue (84 papers)

Citations (19)

View on Semantic Scholar

Summary

The paper introduces a two-step process that generates orthogonal heatmaps from event data and triangulates them to estimate accurate 3D human poses.
It employs CNNs to process high-speed event streams, effectively mitigating motion blur and reducing computational demands compared to RGB methods.
Extensive experiments on both real and synthetic datasets demonstrate competitive accuracy and highlight promising transfer learning opportunities.

Overview of "Lifting Monocular Events to 3D Human Poses"

The paper "Lifting Monocular Events to 3D Human Poses" presents a novel methodology for estimating 3D human poses using event cameras, which record asynchronous events instead of continuous video frames as in traditional RGB cameras. The authors address the challenges associated with detecting human movements in dynamic scenarios where conventional cameras might suffer from motion blur and high computational demands.

Methodology

The proposed method introduces a two-step process:

Event Stream Processing: The initial step involves processing the data from event cameras to predict three orthogonal heatmaps for each body joint. These heatmaps represent the joint’s projection onto three orthogonal planes, utilizing the dynamic and high-speed advantages of event cameras efficiently. The method is designed to estimate these heatmaps using convolutional neural networks (CNNs), which output probabilities corresponding to the 2D projections of the joints.
3D Joint Localization: Subsequently, these heatmaps are synthesized to triangulate and estimate the final 3D positions of human body joints. By leveraging orthogonal projections, the authors effectively circumvent the computational challenges associated with volumetric heatmap representations, offering a more resource-conscious solution without significantly compromising accuracy.

Contributions

The paper makes several significant contributions:

Dataset Creation: The authors introduce a new dataset for event-based human pose estimation by simulating asynchronous events from the well-known Human3.6m dataset. This dataset fills existing gaps by providing challenging scenarios for evaluating the performance of event-based human pose estimation methods.
Comprehensive Experiments: Through extensive experimentation, the method demonstrates competitive accuracy in comparison to traditional RGB-based 3D pose estimation methods, thereby narrowing the performance disparity between these approaches. The evaluations are conducted on both real and synthetic event datasets, including DHP19 and the newly generated Event-Human3.6m datasets.
Transfer Learning Insights: The paper explores the potential of transfer learning from RGB tasks to event-based vision, challenging the utility of pre-training on related tasks within the event-based landscape.

Implications and Future Directions

The implications of this research span both theoretical and practical domains. Theoretically, it advances understanding of how asynchronous event streams can be elevated to comprehend complex human motions; practically, it shows the feasibility and benefits of deploying event-based approaches in scenarios demanding high-speed and dynamic vision processing, such as in sports analytics and autonomous vehicle navigation.

Future research could explore multimodal approaches, integrating both event-based and RGB data for enhanced performance in complex environments. There is also potential for further advancements in the domain of transfer learning, especially understanding which types of pre-training can most effectively translate to event-based learning scenarios.

In conclusion, this paper is a significant step toward realizing robust, resource-efficient human pose estimation using event cameras, contributing to the broader field of computer vision by demonstrating a successful synergy between biological inspiration and computational prowess.

Lifting Monocular Events to 3D Human Poses (2104.10609v1)

Summary

Overview of "Lifting Monocular Events to 3D Human Poses"

Methodology

Contributions

Implications and Future Directions

GitHub

YouTube

Lifting Monocular Events to 3D Human Poses (2104.10609v1)

Summary

Overview of "Lifting Monocular Events to 3D Human Poses"

Methodology

Contributions

Implications and Future Directions

Related Papers

GitHub

YouTube