Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
175 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Video-Based Human Pose Regression via Decoupled Space-Time Aggregation (2403.19926v2)

Published 29 Mar 2024 in cs.CV

Abstract: By leveraging temporal dependency in video sequences, multi-frame human pose estimation algorithms have demonstrated remarkable results in complicated situations, such as occlusion, motion blur, and video defocus. These algorithms are predominantly based on heatmaps, resulting in high computation and storage requirements per frame, which limits their flexibility and real-time application in video scenarios, particularly on edge devices. In this paper, we develop an efficient and effective video-based human pose regression method, which bypasses intermediate representations such as heatmaps and instead directly maps the input to the output joint coordinates. Despite the inherent spatial correlation among adjacent joints of the human pose, the temporal trajectory of each individual joint exhibits relative independence. In light of this, we propose a novel Decoupled Space-Time Aggregation network (DSTA) to separately capture the spatial contexts between adjacent joints and the temporal cues of each individual joint, thereby avoiding the conflation of spatiotemporal dimensions. Concretely, DSTA learns a dedicated feature token for each joint to facilitate the modeling of their spatiotemporal dependencies. With the proposed joint-wise local-awareness attention mechanism, our method is capable of efficiently and flexibly utilizing the spatial dependency of adjacent joints and the temporal dependency of each joint itself. Extensive experiments demonstrate the superiority of our method. Compared to previous regression-based single-frame human pose estimation methods, DSTA significantly enhances performance, achieving an 8.9 mAP improvement on PoseTrack2017. Furthermore, our approach either surpasses or is on par with the state-of-the-art heatmap-based multi-frame human pose estimation methods. Project page: https://github.com/zgspose/DSTA.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. Posetrack: A benchmark for human pose estimation and tracking. In CVPR, pages 5167–5176, 2018.
  2. Learning temporal pose estimation from sparsely labeled videos. In NIPS, 2019.
  3. The center of attention: Center-keypoint grouping via attention for multi-person pose estimation. In ICCV, pages 11853–11863, 2021.
  4. Cascaded pyramid network for multi-person pose estimation. In CVPR, pages 7103–7112, 2018.
  5. Higherhrnet: Scale-aware representation learning for bottom-up human pose estimation. In CVPR, pages 5386–5395, 2020.
  6. BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT (1), pages 4171–4186, 2019.
  7. Posetrack21: A dataset for person search, multi-object tracking and multi-person pose tracking. In CVPR, pages 20931–20940, 2022.
  8. Rmpe: Regional multi-person pose estimation. In ICCV, pages 2334–2343, 2017.
  9. Mutual information-based temporal difference learning for human pose estimation in video. In CVPR, pages 17131–17141, 2023.
  10. Human pose as compositional tokens. In CVPR, pages 660–671, 2023.
  11. Detect-and-track: Efficient pose estimation in videos. In CVPR, pages 350–359, 2018.
  12. Multi-domain pose network for multi-person pose estimation and tracking. In ECCV Workshops, 2018.
  13. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
  14. Mask r-cnn. In ICCV, pages 2961–2969, 2017.
  15. The devil is in the details: Delving into unbiased data processing for human pose estimation. In CVPR, pages 5700–5709, 2020.
  16. Posetrack: Joint multi-person pose estimation and tracking. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4654–4663, 2017.
  17. Otpose: Occlusion-aware transformer for pose estimation in sparsely-labeled videos. In 2022 IEEE International Conference on Systems, Man, and Cybernetics, pages 3255–3260, 2022.
  18. Multi-person articulated tracking with spatial and temporal embeddings. In CVPR, pages 5664–5673, 2019.
  19. Pifpaf: Composite fields for human pose estimation. In CVPR, pages 11977–11986, 2019.
  20. Human pose regression with residual log-likelihood estimation. In ICCV, pages 11005–11014, 2021a.
  21. Pose recognition with cascade transformers. In CVPR, pages 1944–1953, 2021b.
  22. Deep dual consecutive network for human pose estimation. In CVPR, pages 525–534, 2021.
  23. Temporal feature alignment and mutual information maximization for video-based human pose estimation. In CVPR, pages 10996–11006, 2022.
  24. Rethinking the heatmap regression for bottom-up human pose estimation. In CVPR, pages 13264–13273, 2021.
  25. Poseur: Direct human pose regression with transformers. In ECCV, pages 72–88, 2022.
  26. Single-stage multi-person pose machines. In ICCV, pages 6951–6960, 2019.
  27. Mobilenetv2: Inverted residuals and linear bottlenecks. In CVPR, pages 4510–4520, 2018.
  28. Deep high-resolution representation learning for human pose estimation. In CVPR, pages 5686–5696, 2019.
  29. Compositional human pose regression. In ICCV, pages 2621–2630, 2017.
  30. On the convergence speed of amsgrad and beyond. In 2019 IEEE 31st International Conference on Tools with Artificial Intelligence, pages 464–470, 2019.
  31. Joint training of a convolutional network and a graphical model for human pose estimation. In NIPS, page 1799–1807, 2014.
  32. Deeppose: Human pose estimation via deep neural networks. In CVPR, pages 1653–1660, 2014.
  33. Attention is all you need. In NIPS, page 6000–6010, 2017.
  34. Contextual instance decoupling for robust multi-person pose estimation. In CVPR, pages 11060–11068, 2022.
  35. Combining detection and tracking for human pose estimation in videos. In CVPR, pages 11088–11096, 2020.
  36. Point-set anchors for object detection, instance segmentation and pose estimation. In ECCV, pages 527–544, 2020.
  37. Simple baselines for human pose estimation and tracking. In ECCV, 2018.
  38. Pose flow: Efficient online pose tracking. arXiv preprint arXiv:1802.00977, 2018.
  39. Vitae: Vision transformer advanced by exploring intrinsic inductive bias. NIPS, 34:28522–28535, 2021.
  40. Vitpose: Simple vision transformer baselines for human pose estimation. NIPS, 35:38571–38584, 2022.
  41. Transpose: Keypoint localization via transformer. In ICCV, pages 11802–11812, 2021a.
  42. Learning dynamics via graph neural networks for human pose estimation and tracking. In CVPR, pages 8074–8084, 2021b.
  43. Distilpose: Tokenized pose regression with heatmap distillation. In CVPR, pages 2163–2172, 2023.
  44. Multi-person pose estimation for pose tracking with enhanced cascaded pyramid network. In ECCV Workshops, 2018.
  45. Hrformer: High-resolution vision transformer for dense predict. NIPS, 34:7281–7293, 2021.
  46. Fastpose: Towards real-time pose estimation and tracking via scale-normalized multi-task networks. arXiv preprint arXiv:1908.05593, 2019.
  47. 3d human pose estimation with spatial and temporal transformers. In ICCV, pages 11656–11665, 2021.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com