Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Compressed Volumetric Heatmaps for Multi-Person 3D Pose Estimation (2004.00329v1)

Published 1 Apr 2020 in cs.CV

Abstract: In this paper we present a novel approach for bottom-up multi-person 3D human pose estimation from monocular RGB images. We propose to use high resolution volumetric heatmaps to model joint locations, devising a simple and effective compression method to drastically reduce the size of this representation. At the core of the proposed method lies our Volumetric Heatmap Autoencoder, a fully-convolutional network tasked with the compression of ground-truth heatmaps into a dense intermediate representation. A second model, the Code Predictor, is then trained to predict these codes, which can be decompressed at test time to re-obtain the original representation. Our experimental evaluation shows that our method performs favorably when compared to state of the art on both multi-person and single-person 3D human pose estimation datasets and, thanks to our novel compression strategy, can process full-HD images at the constant runtime of 8 fps regardless of the number of subjects in the scene. Code and models available at https://github.com/fabbrimatteo/LoCO .

Citations (81)

Summary

  • The paper presents a novel approach that compresses high-resolution volumetric heatmaps to enable efficient multi-person 3D pose estimation.
  • It leverages a Volumetric Heatmap Autoencoder and Code Predictor to predict compact representations while maintaining a constant 8 fps runtime.
  • The method outperforms state-of-the-art techniques and shows significant potential for real-time applications in surveillance and sports analytics.

Compressed Volumetric Heatmaps for Multi-Person 3D Pose Estimation

This paper presents a novel approach for multi-person 3D human pose estimation from monocular RGB images, emphasizing the utility of high-resolution volumetric heatmaps for modeling joint locations. The researchers introduce an innovative compression method designed to drastically reduce the storage requirements of these volumetric heatmaps, overcoming significant hurdles in memory and computational demands typical of such approaches. Central to this methodology is the development of a Volumetric Heatmap Autoencoder, a fully-convolutional network that compresses ground-truth heatmaps into a dense intermediate representation. Following this, a separate model, the Code Predictor, is trained to predict these codes, allowing the original representation to be decompressed during testing.

The paper demonstrates that this technique performs favorably against existing state-of-the-art methods on both multi-person and single-person 3D human pose estimation datasets. Significantly, thanks to the compression strategy, the proposed method maintains a constant runtime of 8 fps for full-HD images, irrespective of the number of subjects in the scene. The methodological advancements and code are publicly available, enabling further experimentation and validation by the research community.

The problem of Human Pose Estimation (HPE), both in 2D and 3D, has historically leveraged heatmaps to predict body joint locations. However, these methods are fraught with limitations, particularly when extended to multi-person 3D HPE scenarios, where the demand for high memory and computational resources is prohibitive. The proposed solution involves directly predicting high-resolution volumetric heatmaps with minimized storage and computation requirements. This solution aids in tackling multi-person 3D Human Pose Estimation in a single-shot, bottom-up fashion without the need for complex iterative or refining strategies typically employed to manage quantization errors associated with low-resolution heatmaps.

The new compressed representation, named LoCO (Learning on Compressed Output), is grounded in principles of compression and dimensionality reduction for sparse signals. Initial testing on multiple datasets, including JTA, CMU Panoptic, and Human3.6m, demonstrates the method's capacity to maintain a competitive edge even in 100 meter-wide scenes with over 50 people. Furthermore, LoCO's fine-grained predictions establish new performance benchmarks on single-person datasets such as Human3.6m.

Methodologically, the paper comprises several key innovations:

  1. Volumetric Heatmap Autoencoder (VHA): This serves to compress the high-dimensional volumetric heatmaps into a compact representation while preserving critical information such as the location of Gaussian peaks corresponding to body joints.
  2. Code Predictor: A network that predicts the compressed codes obtained from the VHA, which can be subsequently decoded to recover the original volumetric heatmap representation.
  3. Joint Association Scheme: In multi-person contexts, joint association is performed using a straightforward distance-based heuristic, starting with detected heads and linking other joints based on Euclidean proximity, refined with anatomical constraints.

Experiments confirm that the approach surpasses alternative methodologies, including recent bottom-up methods that use Location Maps. The proposed bottom-up approach not only eliminates dependencies on additional detection phases but also maintains consistent processing times regardless of the number of people in the scene, demonstrating significant processing efficiency over top-down approaches.

This work has consequential implications for areas such as real-time surveillance or sports analytics applications that require rapid, accurate human motion tracking. The volumetric heatmaps, once computationally prohibitive for multi-person 3D pose estimation, are now tractable due to the proposed compression strategy. Moreover, the versatility of LoCO in performing well across both dense multi-person and single-person datasets implies its broad applicability.

Looking forward, the development efforts in the field could yield further enhancements in spatial resolution or optimization algorithms to enhance the method’s efficiency. The compression paradigm introduced here may influence how representations of spatial data are handled more broadly across various branches of artificial intelligence and computer vision, potentially extending beyond HPE to other complex, high-dimensional inference tasks.