Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Hourglass Tokenizer for Efficient Transformer-Based 3D Human Pose Estimation (2311.12028v2)

Published 20 Nov 2023 in cs.CV, cs.AI, and cs.LG

Abstract: Transformers have been successfully applied in the field of video-based 3D human pose estimation. However, the high computational costs of these video pose transformers (VPTs) make them impractical on resource-constrained devices. In this paper, we present a plug-and-play pruning-and-recovering framework, called Hourglass Tokenizer (HoT), for efficient transformer-based 3D human pose estimation from videos. Our HoT begins with pruning pose tokens of redundant frames and ends with recovering full-length tokens, resulting in a few pose tokens in the intermediate transformer blocks and thus improving the model efficiency. To effectively achieve this, we propose a token pruning cluster (TPC) that dynamically selects a few representative tokens with high semantic diversity while eliminating the redundancy of video frames. In addition, we develop a token recovering attention (TRA) to restore the detailed spatio-temporal information based on the selected tokens, thereby expanding the network output to the original full-length temporal resolution for fast inference. Extensive experiments on two benchmark datasets (i.e., Human3.6M and MPI-INF-3DHP) demonstrate that our method can achieve both high efficiency and estimation accuracy compared to the original VPT models. For instance, applying to MotionBERT and MixSTE on Human3.6M, our HoT can save nearly 50% FLOPs without sacrificing accuracy and nearly 40% FLOPs with only 0.2% accuracy drop, respectively. Code and models are available at https://github.com/NationalGAILab/HoT.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. Token merging: Your ViT but faster. In ICLR, 2022.
  2. Exploiting spatial-temporal relationships for 3D pose estimation via graph convolutional networks. In ICCV, pages 2272–2281, 2019.
  3. Making vision transformers efficient from a token sparsification view. In CVPR, pages 6195–6205, 2023.
  4. HDFormer: High-order directed transformer for 3D human pose estimation. In IJCAI, pages 581–589, 2023.
  5. Anatomy-aware 3D human pose estimation with bone-based pose decomposition. IEEE TCSVT, 32(1):198–209, 2021.
  6. MobileHumanPose: Toward real-time 3D human pose estimation in mobile devices. In CVPR, pages 2328–2338, 2021.
  7. TORE: Token reduction for efficient human mesh recovery with transformer. In ICCV, pages 15143–15155, 2023.
  8. Study on density peaks clustering based on k-nearest neighbors and principal component analysis. Knowledge-Based Systems, 99:135–145, 2016.
  9. Uplift and upsample: Efficient 3D human pose estimation with uplifting transformers. In WACV, pages 2903–2913, 2023.
  10. Adaptive token sampling for efficient vision transformers. In ECCV, pages 396–414, 2022.
  11. Human 3D pose estimation with a tilting camera for social mobile robot interaction. Sensors, 19(22):4943, 2019.
  12. PoseAug: A differentiable pose augmentation framework for 3D human pose estimation. In CVPR, pages 8575–8584, 2021.
  13. Conditional directed graph convolution for 3D human pose estimation. In ACMMM, pages 602–611, 2021.
  14. Human3.6M: Large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE TPAMI, 36(7):1325–1339, 2013.
  15. SPViT: Enabling faster vision transformers via latency-aware soft token pruning. In ECCV, pages 620–640, 2022.
  16. Sait: Sparse vision transformers through adaptive token pruning. arXiv preprint arXiv:2210.05832, 2022a.
  17. Cascaded deep monocular 3D human pose estimation with evolutionary training data. In CVPR, pages 6173–6183, 2020.
  18. Exploiting temporal contexts with strided transformer for 3D human pose estimation. IEEE TMM, 25:1282–1293, 2022b.
  19. MHFormer: Multi-hypothesis transformer for 3D human pose estimation. In CVPR, pages 13147–13156, 2022c.
  20. Multi-hypothesis representation learning for transformer-based 3D human pose estimation. PR, 141:109631, 2023.
  21. Not all patches are what you need: Expediting vision transformers via token reorganizations. In ICLR, 2022.
  22. PoSynDA: Multi-hypothesis pose synthesis domain adaptation for robust 3D human pose estimation. In ACM MM, pages 5542–5551, 2023.
  23. Enhanced skeleton visualization for view invariant human action recognition. PR, 68:346–362, 2017.
  24. Attention mechanism exploits temporal contexts: Real-time 3D human pose reconstruction. In CVPR, pages 5064–5073, 2020.
  25. Beyond attentive tokens: Incorporating token importance and diversity for efficient vision transformers. In CVPR, pages 10334–10343, 2023.
  26. Multi-task deep learning for real-time 3D human pose estimation and action recognition. IEEE TPAMI, 43(8):2752–2764, 2020.
  27. PPT: Token-pruned pose transformer for monocular and multi-view human pose estimation. In ECCV, pages 424–442, 2022.
  28. Token pooling in vision transformers. arXiv preprint arXiv:2110.03860, 2021.
  29. A simple yet effective baseline for 3D human pose estimation. In ICCV, pages 2640–2649, 2017.
  30. Monocular 3D human pose estimation in the wild using improved CNN supervision. In 3DV, pages 506–516, 2017a.
  31. VNect: Real-time 3D human pose estimation with a single rgb camera. ACM TOG, 36(4):1–14, 2017b.
  32. Stacked hourglass networks for human pose estimation. In ECCV, pages 483–499, 2016.
  33. 3D human pose estimation in video with temporal convolutions and semi-supervised training. In CVPR, pages 7753–7762, 2019.
  34. DynamicViT: Efficient vision transformers with dynamic token sparsification. In NeurIPS, pages 13937–13949, 2021.
  35. P-STMO: Pre-trained spatial temporal many-to-one model for 3D human pose estimation. In ECCV, 2022.
  36. 3D human pose estimation with spatio-temporal criss-cross attention. In CVPR, pages 4790–4799, 2023.
  37. Attention is all you need. In NeurIPS, pages 5998–6008, 2017.
  38. Motion guided 3D pose estimation from videos. In ECCV, pages 764–780, 2020.
  39. Depth pooling based large-scale 3D action recognition with convolutional neural networks. IEEE TMM, 20(5):1051–1061, 2018.
  40. KVT: K-nn attention for boosting vision transformers. In ECCV, pages 285–302, 2022a.
  41. VTC-LFC: Vision transformer compression with low-frequency components. In NeurIPS, pages 13974–13988, 2022b.
  42. ClusTR: Exploring efficient self-attention via clustering for vision transformers. arXiv preprint arXiv:2208.13138, 2022.
  43. A-ViT: Adaptive tokens for efficient vision transformer. In CVPR, pages 10809–10818, 2022.
  44. Co-evolution of pose and mesh for 3D human body estimation from video. In ICCV, pages 14963–14973, 2023.
  45. SRNet: Improving generalization in 3D human pose estimation with a split-and-recombine approach. In ECCV, pages 507–523, 2020.
  46. DeciWatch: A simple baseline for 10x efficient 2D and 3D pose estimation. In ECCV, pages 607–624, 2022a.
  47. Not all tokens are equal: Human-centric visual analysis via token clustering transformer. In CVPR, pages 11101–11111, 2022b.
  48. Uncertainty-aware 3D human pose estimation from monocular video. In ACMMM, pages 5102–5113, 2022a.
  49. MixSTE: Seq2seq mixed spatio-temporal encoder for 3D human pose estimation in video. In CVPR, pages 13232–13242, 2022b.
  50. Semantic graph convolutional networks for 3D human pose regression. In CVPR, pages 3425–3435, 2019.
  51. PoseFormerV2: Exploring frequency domain for efficient and robust 3D human pose estimation. In CVPR, pages 8877–8886, 2023.
  52. 3D human pose estimation with spatial and temporal transformers. In ICCV, pages 11656–11665, 2021.
  53. MotionBERT: A unified perspective on learning human motion representations. In ICCV, pages 15085–15099, 2023.
  54. 3D human pose estimation in rgbd images for robotic task learning. In ICRA, pages 1986–1992, 2018.
  55. Modulated graph convolutional network for 3D human pose estimation. In ICCV, pages 11477–11487, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Wenhao Li (136 papers)
  2. Mengyuan Liu (72 papers)
  3. Hong Liu (395 papers)
  4. Pichao Wang (65 papers)
  5. Jialun Cai (3 papers)
  6. Nicu Sebe (270 papers)
Citations (5)

Summary

Overview of the Hourglass Tokenizer for Efficient Transformer-Based 3D Human Pose Estimation

This paper addresses a critical challenge within the domain of video-based 3D human pose estimation (HPE) by proposing the Hourglass Tokenizer (HoT), a framework designed to improve the efficiency of transformer-based architectures. These architectures have been lauded for their ability to model long-range dependencies in data, achieving state-of-the-art results in HPE. However, they are computationally intensive, leading to difficulties when deploying on resource-constrained devices. The Hourglass Tokenizer introduces an innovative pruning-and-recovering strategy aimed at enhancing efficiency without sacrificing accuracy.

Key Components and Methodology

The framework revolves around two vital components: Token Pruning Cluster (TPC) and Token Recovering Attention (TRA).

  1. Token Pruning Cluster (TPC): TPC acts as a dynamic mechanism to eliminate redundancy by selecting a minimal number of representative tokens. It achieves this through a clustering methodology that selects tokens with high semantic diversity, crucial for maintaining the integrity of the spatial-temporal information necessary for accurate HPE. The selected cluster centers retain the most informative parts, thereby significantly reducing the amount of data that the transformer's intermediate layers need to process.
  2. Token Recovering Attention (TRA): Following the pruning, the TRA module is introduced to recover the original sequence's full length for inference, ensuring that all frames contribute to the final pose estimation. This is particularly important for the scalability of the solution in applied settings, where the output of the 3D pose for each frame in an input sequence is often required swiftly.

Innovations and Contributions

The proposed framework is both plug-and-play and efficient, designed to enhance the capabilities of existing video pose transformers (VPTs) by significantly reducing the computational burden without performance degradation. It introduces a paradigm shift by showing that full-length pose sequences are not mandatory within the intermediate stages of processing — a few representative tokens can suffice for maintaining accuracy.

Notably, the authors validate their framework on two benchmark datasets, Human3.6M and MPI-INF-3DHP, showing substantial efficiency gains. For instance, by integrating HoT with MotionBERT and MixSTE on the Human3.6M dataset, there is nearly a 50% reduction in FLOPs with no loss in accuracy or only a minor 0.2% drop in estimation precision, showcasing the potential of the proposed approach for resource-efficient deployment in real-world scenarios.

Practical and Theoretical Implications

On practical grounds, the findings of this paper hold significant potential for deploying sophisticated HPE models in constrained environments, such as mobile devices or embedded systems. The reduction in computational load and memory requirements can enable real-time processing and application versatility across various fields, such as human-computer interaction, sports analytics, and surveillance.

From a theoretical perspective, the paper challenges the necessity of processing the entire sequential input in VPTs and suggests a promising research direction in token selection and dynamic management. This raises opportunities for wider applications beyond HPE, potentially affecting various domains reliant on transformer architectures, indicating future developments could revolve around efficient input token management.

Conclusion

The Hourglass Tokenizer sets forth a clear advancement in addressing computational efficiency in transformer-based 3D HPE models. By rethinking the token processing mechanism through intelligent pruning and recovering, it promises a blend of high accuracy with reduced computational costs. This approach opens avenues for future explorations and optimizations in AI applications where computational efficiency without accuracy compromise is paramount.