Hourglass Tokenizer for Efficient Transformer-Based 3D Human Pose Estimation (2311.12028v2)
Abstract: Transformers have been successfully applied in the field of video-based 3D human pose estimation. However, the high computational costs of these video pose transformers (VPTs) make them impractical on resource-constrained devices. In this paper, we present a plug-and-play pruning-and-recovering framework, called Hourglass Tokenizer (HoT), for efficient transformer-based 3D human pose estimation from videos. Our HoT begins with pruning pose tokens of redundant frames and ends with recovering full-length tokens, resulting in a few pose tokens in the intermediate transformer blocks and thus improving the model efficiency. To effectively achieve this, we propose a token pruning cluster (TPC) that dynamically selects a few representative tokens with high semantic diversity while eliminating the redundancy of video frames. In addition, we develop a token recovering attention (TRA) to restore the detailed spatio-temporal information based on the selected tokens, thereby expanding the network output to the original full-length temporal resolution for fast inference. Extensive experiments on two benchmark datasets (i.e., Human3.6M and MPI-INF-3DHP) demonstrate that our method can achieve both high efficiency and estimation accuracy compared to the original VPT models. For instance, applying to MotionBERT and MixSTE on Human3.6M, our HoT can save nearly 50% FLOPs without sacrificing accuracy and nearly 40% FLOPs with only 0.2% accuracy drop, respectively. Code and models are available at https://github.com/NationalGAILab/HoT.
- Token merging: Your ViT but faster. In ICLR, 2022.
- Exploiting spatial-temporal relationships for 3D pose estimation via graph convolutional networks. In ICCV, pages 2272–2281, 2019.
- Making vision transformers efficient from a token sparsification view. In CVPR, pages 6195–6205, 2023.
- HDFormer: High-order directed transformer for 3D human pose estimation. In IJCAI, pages 581–589, 2023.
- Anatomy-aware 3D human pose estimation with bone-based pose decomposition. IEEE TCSVT, 32(1):198–209, 2021.
- MobileHumanPose: Toward real-time 3D human pose estimation in mobile devices. In CVPR, pages 2328–2338, 2021.
- TORE: Token reduction for efficient human mesh recovery with transformer. In ICCV, pages 15143–15155, 2023.
- Study on density peaks clustering based on k-nearest neighbors and principal component analysis. Knowledge-Based Systems, 99:135–145, 2016.
- Uplift and upsample: Efficient 3D human pose estimation with uplifting transformers. In WACV, pages 2903–2913, 2023.
- Adaptive token sampling for efficient vision transformers. In ECCV, pages 396–414, 2022.
- Human 3D pose estimation with a tilting camera for social mobile robot interaction. Sensors, 19(22):4943, 2019.
- PoseAug: A differentiable pose augmentation framework for 3D human pose estimation. In CVPR, pages 8575–8584, 2021.
- Conditional directed graph convolution for 3D human pose estimation. In ACMMM, pages 602–611, 2021.
- Human3.6M: Large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE TPAMI, 36(7):1325–1339, 2013.
- SPViT: Enabling faster vision transformers via latency-aware soft token pruning. In ECCV, pages 620–640, 2022.
- Sait: Sparse vision transformers through adaptive token pruning. arXiv preprint arXiv:2210.05832, 2022a.
- Cascaded deep monocular 3D human pose estimation with evolutionary training data. In CVPR, pages 6173–6183, 2020.
- Exploiting temporal contexts with strided transformer for 3D human pose estimation. IEEE TMM, 25:1282–1293, 2022b.
- MHFormer: Multi-hypothesis transformer for 3D human pose estimation. In CVPR, pages 13147–13156, 2022c.
- Multi-hypothesis representation learning for transformer-based 3D human pose estimation. PR, 141:109631, 2023.
- Not all patches are what you need: Expediting vision transformers via token reorganizations. In ICLR, 2022.
- PoSynDA: Multi-hypothesis pose synthesis domain adaptation for robust 3D human pose estimation. In ACM MM, pages 5542–5551, 2023.
- Enhanced skeleton visualization for view invariant human action recognition. PR, 68:346–362, 2017.
- Attention mechanism exploits temporal contexts: Real-time 3D human pose reconstruction. In CVPR, pages 5064–5073, 2020.
- Beyond attentive tokens: Incorporating token importance and diversity for efficient vision transformers. In CVPR, pages 10334–10343, 2023.
- Multi-task deep learning for real-time 3D human pose estimation and action recognition. IEEE TPAMI, 43(8):2752–2764, 2020.
- PPT: Token-pruned pose transformer for monocular and multi-view human pose estimation. In ECCV, pages 424–442, 2022.
- Token pooling in vision transformers. arXiv preprint arXiv:2110.03860, 2021.
- A simple yet effective baseline for 3D human pose estimation. In ICCV, pages 2640–2649, 2017.
- Monocular 3D human pose estimation in the wild using improved CNN supervision. In 3DV, pages 506–516, 2017a.
- VNect: Real-time 3D human pose estimation with a single rgb camera. ACM TOG, 36(4):1–14, 2017b.
- Stacked hourglass networks for human pose estimation. In ECCV, pages 483–499, 2016.
- 3D human pose estimation in video with temporal convolutions and semi-supervised training. In CVPR, pages 7753–7762, 2019.
- DynamicViT: Efficient vision transformers with dynamic token sparsification. In NeurIPS, pages 13937–13949, 2021.
- P-STMO: Pre-trained spatial temporal many-to-one model for 3D human pose estimation. In ECCV, 2022.
- 3D human pose estimation with spatio-temporal criss-cross attention. In CVPR, pages 4790–4799, 2023.
- Attention is all you need. In NeurIPS, pages 5998–6008, 2017.
- Motion guided 3D pose estimation from videos. In ECCV, pages 764–780, 2020.
- Depth pooling based large-scale 3D action recognition with convolutional neural networks. IEEE TMM, 20(5):1051–1061, 2018.
- KVT: K-nn attention for boosting vision transformers. In ECCV, pages 285–302, 2022a.
- VTC-LFC: Vision transformer compression with low-frequency components. In NeurIPS, pages 13974–13988, 2022b.
- ClusTR: Exploring efficient self-attention via clustering for vision transformers. arXiv preprint arXiv:2208.13138, 2022.
- A-ViT: Adaptive tokens for efficient vision transformer. In CVPR, pages 10809–10818, 2022.
- Co-evolution of pose and mesh for 3D human body estimation from video. In ICCV, pages 14963–14973, 2023.
- SRNet: Improving generalization in 3D human pose estimation with a split-and-recombine approach. In ECCV, pages 507–523, 2020.
- DeciWatch: A simple baseline for 10x efficient 2D and 3D pose estimation. In ECCV, pages 607–624, 2022a.
- Not all tokens are equal: Human-centric visual analysis via token clustering transformer. In CVPR, pages 11101–11111, 2022b.
- Uncertainty-aware 3D human pose estimation from monocular video. In ACMMM, pages 5102–5113, 2022a.
- MixSTE: Seq2seq mixed spatio-temporal encoder for 3D human pose estimation in video. In CVPR, pages 13232–13242, 2022b.
- Semantic graph convolutional networks for 3D human pose regression. In CVPR, pages 3425–3435, 2019.
- PoseFormerV2: Exploring frequency domain for efficient and robust 3D human pose estimation. In CVPR, pages 8877–8886, 2023.
- 3D human pose estimation with spatial and temporal transformers. In ICCV, pages 11656–11665, 2021.
- MotionBERT: A unified perspective on learning human motion representations. In ICCV, pages 15085–15099, 2023.
- 3D human pose estimation in rgbd images for robotic task learning. In ICRA, pages 1986–1992, 2018.
- Modulated graph convolutional network for 3D human pose estimation. In ICCV, pages 11477–11487, 2021.
- Wenhao Li (136 papers)
- Mengyuan Liu (72 papers)
- Hong Liu (395 papers)
- Pichao Wang (65 papers)
- Jialun Cai (3 papers)
- Nicu Sebe (270 papers)