Hourglass Tokenizer for Efficient Transformer-Based 3D Human Pose Estimation (2311.12028v2)

Published 20 Nov 2023 in cs.CV, cs.AI, and cs.LG

Abstract: Transformers have been successfully applied in the field of video-based 3D human pose estimation. However, the high computational costs of these video pose transformers (VPTs) make them impractical on resource-constrained devices. In this paper, we present a plug-and-play pruning-and-recovering framework, called Hourglass Tokenizer (HoT), for efficient transformer-based 3D human pose estimation from videos. Our HoT begins with pruning pose tokens of redundant frames and ends with recovering full-length tokens, resulting in a few pose tokens in the intermediate transformer blocks and thus improving the model efficiency. To effectively achieve this, we propose a token pruning cluster (TPC) that dynamically selects a few representative tokens with high semantic diversity while eliminating the redundancy of video frames. In addition, we develop a token recovering attention (TRA) to restore the detailed spatio-temporal information based on the selected tokens, thereby expanding the network output to the original full-length temporal resolution for fast inference. Extensive experiments on two benchmark datasets (i.e., Human3.6M and MPI-INF-3DHP) demonstrate that our method can achieve both high efficiency and estimation accuracy compared to the original VPT models. For instance, applying to MotionBERT and MixSTE on Human3.6M, our HoT can save nearly 50% FLOPs without sacrificing accuracy and nearly 40% FLOPs with only 0.2% accuracy drop, respectively. Code and models are available at https://github.com/NationalGAILab/HoT.

References (55)

Authors (6)

Wenhao Li (136 papers)
Mengyuan Liu (72 papers)
Hong Liu (395 papers)
Pichao Wang (65 papers)
Jialun Cai (3 papers)
Nicu Sebe (270 papers)

Citations (5)

View on Semantic Scholar

Summary

Overview of the Hourglass Tokenizer for Efficient Transformer-Based 3D Human Pose Estimation

This paper addresses a critical challenge within the domain of video-based 3D human pose estimation (HPE) by proposing the Hourglass Tokenizer (HoT), a framework designed to improve the efficiency of transformer-based architectures. These architectures have been lauded for their ability to model long-range dependencies in data, achieving state-of-the-art results in HPE. However, they are computationally intensive, leading to difficulties when deploying on resource-constrained devices. The Hourglass Tokenizer introduces an innovative pruning-and-recovering strategy aimed at enhancing efficiency without sacrificing accuracy.

Key Components and Methodology

The framework revolves around two vital components: Token Pruning Cluster (TPC) and Token Recovering Attention (TRA).

Token Pruning Cluster (TPC): TPC acts as a dynamic mechanism to eliminate redundancy by selecting a minimal number of representative tokens. It achieves this through a clustering methodology that selects tokens with high semantic diversity, crucial for maintaining the integrity of the spatial-temporal information necessary for accurate HPE. The selected cluster centers retain the most informative parts, thereby significantly reducing the amount of data that the transformer's intermediate layers need to process.
Token Recovering Attention (TRA): Following the pruning, the TRA module is introduced to recover the original sequence's full length for inference, ensuring that all frames contribute to the final pose estimation. This is particularly important for the scalability of the solution in applied settings, where the output of the 3D pose for each frame in an input sequence is often required swiftly.

Innovations and Contributions

The proposed framework is both plug-and-play and efficient, designed to enhance the capabilities of existing video pose transformers (VPTs) by significantly reducing the computational burden without performance degradation. It introduces a paradigm shift by showing that full-length pose sequences are not mandatory within the intermediate stages of processing — a few representative tokens can suffice for maintaining accuracy.

Notably, the authors validate their framework on two benchmark datasets, Human3.6M and MPI-INF-3DHP, showing substantial efficiency gains. For instance, by integrating HoT with MotionBERT and MixSTE on the Human3.6M dataset, there is nearly a 50% reduction in FLOPs with no loss in accuracy or only a minor 0.2% drop in estimation precision, showcasing the potential of the proposed approach for resource-efficient deployment in real-world scenarios.

Practical and Theoretical Implications

On practical grounds, the findings of this paper hold significant potential for deploying sophisticated HPE models in constrained environments, such as mobile devices or embedded systems. The reduction in computational load and memory requirements can enable real-time processing and application versatility across various fields, such as human-computer interaction, sports analytics, and surveillance.

From a theoretical perspective, the paper challenges the necessity of processing the entire sequential input in VPTs and suggests a promising research direction in token selection and dynamic management. This raises opportunities for wider applications beyond HPE, potentially affecting various domains reliant on transformer architectures, indicating future developments could revolve around efficient input token management.

Conclusion

The Hourglass Tokenizer sets forth a clear advancement in addressing computational efficiency in transformer-based 3D HPE models. By rethinking the token processing mechanism through intelligent pruning and recovering, it promises a blend of high accuracy with reduced computational costs. This approach opens avenues for future explorations and optimizations in AI applications where computational efficiency without accuracy compromise is paramount.

PDF Markdown

Related Papers

GitHub

GitHub - NationalGAILab/HoT: [CVPR 2024 🔥] Official implementation of the paper "⏳ Hourglass Tokenizer for Efficient Transformer-Based 3D Human Pose Estimation" (224 stars)