Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

STAF: 3D Human Mesh Recovery from Video with Spatio-Temporal Alignment Fusion (2401.01730v1)

Published 3 Jan 2024 in cs.CV

Abstract: The recovery of 3D human mesh from monocular images has significantly been developed in recent years. However, existing models usually ignore spatial and temporal information, which might lead to mesh and image misalignment and temporal discontinuity. For this reason, we propose a novel Spatio-Temporal Alignment Fusion (STAF) model. As a video-based model, it leverages coherence clues from human motion by an attention-based Temporal Coherence Fusion Module (TCFM). As for spatial mesh-alignment evidence, we extract fine-grained local information through predicted mesh projection on the feature maps. Based on the spatial features, we further introduce a multi-stage adjacent Spatial Alignment Fusion Module (SAFM) to enhance the feature representation of the target frame. In addition to the above, we propose an Average Pooling Module (APM) to allow the model to focus on the entire input sequence rather than just the target frame. This method can remarkably improve the smoothness of recovery results from video. Extensive experiments on 3DPW, MPII3D, and H36M demonstrate the superiority of STAF. We achieve a state-of-the-art trade-off between precision and smoothness. Our code and more video results are on the project page https://yw0208.github.io/staf/

Definition Search Book Streamline Icon: https://streamlinehq.com
References (62)
  1. Z. Luo, S. A. Golestaneh, and K. M. Kitani, “3d human motion estimation via motion compression and refinement,” in Proceedings of the Asian Conference on Computer Vision, 2020.
  2. G. T. Papadopoulos and P. Daras, “Human action recognition using 3d reconstruction data,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 28, no. 8, pp. 1807–1823, 2018.
  3. G. Wei, C. Lan, W. Zeng, and Z. Chen, “View invariant 3d human pose estimation,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 12, pp. 4601–4610, 2020.
  4. R. Gu, G. Wang, Z. Jiang, and J.-N. Hwang, “Multi-person hierarchical 3d pose estimation in natural videos,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 11, pp. 4245–4257, 2020.
  5. T. Chen, C. Fang, X. Shen, Y. Zhu, Z. Chen, and J. Luo, “Anatomy-aware 3d human pose estimation with bone-based pose decomposition,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 1, pp. 198–209, 2022.
  6. Y. Tian, H. Zhang, Y. Liu, and L. Wang, “Recovering 3d human mesh from monocular images: A survey,” IEEE transactions on pattern analysis and machine intelligence, 2023.
  7. M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black, “Smpl: A skinned multi-person linear model,” ACM transactions on graphics (TOG), vol. 34, no. 6, pp. 1–16, 2015.
  8. H. Choi, G. Moon, J. Y. Chang, and K. M. Lee, “Beyond static features for temporally consistent 3d human pose and shape from a video,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 1964–1973.
  9. C. Doersch and A. Zisserman, “Sim2real transfer learning for 3d human pose estimation: motion to the rescue,” Advances in Neural Information Processing Systems, vol. 32, 2019.
  10. A. Kanazawa, J. Y. Zhang, P. Felsen, and J. Malik, “Learning 3d human dynamics from video,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 5614–5623.
  11. M. Kocabas, N. Athanasiou, and M. J. Black, “Vibe: Video inference for human body pose and shape estimation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 5253–5263.
  12. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  13. X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7794–7803.
  14. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  15. K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution representation learning for human pose estimation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 5693–5703.
  16. O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention.   Springer, 2015, pp. 234–241.
  17. A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks for human pose estimation,” in European conference on computer vision.   Springer, 2016, pp. 483–499.
  18. T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2117–2125.
  19. J. Zhang, D. Yu, J. H. Liew, X. Nie, and J. Feng, “Body meshes as points,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 546–556.
  20. M. Omran, C. Lassner, G. Pons-Moll, P. Gehler, and B. Schiele, “Neural body fitting: Unifying deep learning and model based human pose and shape estimation,” in 2018 international conference on 3D vision (3DV).   IEEE, 2018, pp. 484–494.
  21. A. Zanfir, E. G. Bazavan, M. Zanfir, W. T. Freeman, R. Sukthankar, and C. Sminchisescu, “Neural descent for visual 3d human pose and shape,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 14 484–14 493.
  22. M. Zanfir, A. Zanfir, E. G. Bazavan, W. T. Freeman, R. Sukthankar, and C. Sminchisescu, “Thundr: Transformer-based 3d human reconstruction with markers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 12 971–12 980.
  23. H. Zhang, Y. Tian, X. Zhou, W. Ouyang, Y. Liu, L. Wang, and Z. Sun, “Pymaf: 3d human pose and shape regression with pyramidal mesh alignment feedback loop,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 11 446–11 456.
  24. Y. Xu, S.-C. Zhu, and T. Tung, “Denserac: Joint 3d pose and shape estimation by dense render-and-compare,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 7760–7770.
  25. H. Zhang, J. Cao, G. Lu, W. Ouyang, and Z. Sun, “Learning 3d human shape and pose from dense body parts,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
  26. B. Huang, T. Zhang, and Y. Wang, “Pose2uv: Single-shot multiperson mesh recovery with deep uv prior,” IEEE Transactions on Image Processing, vol. 31, pp. 4679–4692, 2022.
  27. T. Alldieck, M. Kassubeck, B. Wandt, B. Rosenhahn, and M. Magnor, “Optical flow-based 3d human motion estimation from monocular video,” in German Conference on Pattern Recognition.   Springer, 2017, pp. 347–360.
  28. A. Arnab, C. Doersch, and A. Zisserman, “Exploiting temporal context for 3d human pose estimation in the wild,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 3395–3404.
  29. Z. Li, B. Xu, H. Huang, C. Lu, and Y. Guo, “Deep two-stream video inference for human body pose and shape estimation,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2022, pp. 430–439.
  30. D. Anguelov, P. Srinivasan, D. Koller, S. Thrun, J. Rodgers, and J. Davis, “Scape: shape completion and animation of people,” in ACM SIGGRAPH 2005 Papers, 2005, pp. 408–416.
  31. H. Xu, E. G. Bazavan, A. Zanfir, W. T. Freeman, R. Sukthankar, and C. Sminchisescu, “Ghum & ghuml: Generative 3d human shape and articulated pose models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 6184–6193.
  32. C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu, “Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments,” IEEE transactions on pattern analysis and machine intelligence, vol. 36, no. 7, pp. 1325–1339, 2013.
  33. T. Von Marcard, R. Henschel, M. J. Black, B. Rosenhahn, and G. Pons-Moll, “Recovering accurate 3d human pose in the wild using imus and a moving camera,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 601–617.
  34. D. Mehta, H. Rhodin, D. Casas, P. Fua, O. Sotnychenko, W. Xu, and C. Theobalt, “Monocular 3d human pose estimation in the wild using improved cnn supervision,” in 2017 international conference on 3D vision (3DV).   IEEE, 2017, pp. 506–516.
  35. L. Sigal, A. Balan, and M. Black, “Combined discriminative and generative articulated pose and non-rigid shape estimation,” Advances in neural information processing systems, vol. 20, 2007.
  36. P. Guan, A. Weiss, A. O. Balan, and M. J. Black, “Estimating human shape and pose from a single image,” in 2009 IEEE 12th International Conference on Computer Vision.   IEEE, 2009, pp. 1381–1388.
  37. F. Bogo, A. Kanazawa, C. Lassner, P. Gehler, J. Romero, and M. J. Black, “Keep it smpl: Automatic estimation of 3d human pose and shape from a single image,” in European conference on computer vision.   Springer, 2016, pp. 561–578.
  38. A. Kanazawa, M. J. Black, D. W. Jacobs, and J. Malik, “End-to-end recovery of human shape and pose,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7122–7131.
  39. N. Kolotouros, G. Pavlakos, and K. Daniilidis, “Convolutional mesh regression for single-image human shape reconstruction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 4501–4510.
  40. P. Yao, Z. Fang, F. Wu, Y. Feng, and J. Li, “Densebody: Directly regressing dense 3d human pose and shape from a single color image,” arXiv preprint arXiv:1903.10153, 2019.
  41. H. Zhang, Y. Tian, Y. Zhang, M. Li, L. An, Z. Sun, and Y. Liu, “Pymaf-x: Towards well-aligned full-body model regression from monocular images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  42. Z. Li, J. Liu, Z. Zhang, S. Xu, and Y. Yan, “Cliff: Carrying location information in full frames into human pose and shape estimation,” in European Conference on Computer Vision.   Springer, 2022, pp. 590–606.
  43. K. Lin, L. Wang, and Z. Liu, “Mesh graphormer,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 12 939–12 948.
  44. N. Kolotouros, G. Pavlakos, D. Jayaraman, and K. Daniilidis, “Probabilistic modeling for human mesh recovery,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 11 605–11 614.
  45. Y. Sun, Q. Bao, W. Liu, Y. Fu, M. J. Black, and T. Mei, “Monocular, one-stage, regression of multiple 3d people,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 11 179–11 188.
  46. L. Wang, X. Liu, X. Ma, J. Wu, J. Cheng, and M. Zhou, “A progressive quadric graph convolutional network for 3d human mesh recovery,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 1, pp. 104–117, 2023.
  47. N. Kolotouros, G. Pavlakos, M. J. Black, and K. Daniilidis, “Learning to reconstruct 3d human pose and shape via model-fitting in the loop,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 2252–2261.
  48. H. Joo, N. Neverova, and A. Vedaldi, “Exemplar fine-tuning for 3d human pose fitting towards in-the-wild 3d human pose estimation,” 2020.
  49. Z. Wan, Z. Li, M. Tian, J. Liu, S. Yi, and H. Li, “Encoder-decoder with multi-level attention for 3d human shape and pose estimation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 13 033–13 042.
  50. Y. Sun, Y. Ye, W. Liu, W. Gao, Y. Fu, and T. Mei, “Human mesh recovery from monocular images via a skeleton-disentangled representation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 5349–5358.
  51. W.-L. Wei, J.-C. Lin, T.-L. Liu, and H.-Y. M. Liao, “Capturing humans in motion: Temporal-attentive 3d human pose and shape estimation from monocular video,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 13 211–13 220.
  52. G. Papandreou, T. Zhu, N. Kanazawa, A. Toshev, J. Tompson, C. Bregler, and K. Murphy, “Towards accurate multi-person pose estimation in the wild,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4903–4911.
  53. T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13.   Springer, 2014, pp. 740–755.
  54. S. Johnson and M. Everingham, “Clustered pose and nonlinear appearance models for human pose estimation.” in bmvc, vol. 2, no. 4.   Aberystwyth, UK, 2010, p. 5.
  55. S. Johnson and M. Everingham, “Learning effective human pose estimation from inaccurate annotation,” in CVPR 2011, 2011, pp. 1465–1472.
  56. G. Moon, H. Choi, and K. M. Lee, “Neuralannot: Neural annotator for 3d human mesh training sets,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 2299–2307.
  57. M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele, “2d human pose estimation: New benchmark and state of the art analysis,” in Proceedings of the IEEE Conference on computer Vision and Pattern Recognition, 2014, pp. 3686–3693.
  58. M. Andriluka, U. Iqbal, E. Insafutdinov, L. Pishchulin, A. Milan, J. Gall, and B. Schiele, “Posetrack: A benchmark for human pose estimation and tracking,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 5167–5176.
  59. M. Kocabas, C.-H. P. Huang, O. Hilliges, and M. J. Black, “Pare: Part attention regressor for 3d human body estimation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 11 127–11 137.
  60. G. Moon and K. M. Lee, “I2l-meshnet: Image-to-lixel prediction network for accurate 3d human pose and mesh estimation from a single rgb image,” in European Conference on Computer Vision.   Springer, 2020, pp. 752–768.
  61. T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European conference on computer vision.   Springer, 2014, pp. 740–755.
  62. A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” 2017.
Citations (2)

Summary

  • The paper demonstrates that the STAF model significantly improves video-based 3D human mesh recovery by integrating spatio-temporal alignment.
  • The methodology employs three key modules—TCFM, SAFM, and APM—to address spatial misalignment and temporal discontinuity in mesh recovery.
  • Experimental results on benchmarks like 3DPW and Human 3.6M show superior performance metrics compared to prior state-of-the-art models.

STAF: 3D Human Mesh Recovery from Video with Spatio-Temporal Alignment Fusion

In recent years, the challenge of recovering 3D human mesh from monocular images has advanced significantly. The paper "STAF: 3D Human Mesh Recovery from Video with Spatio-Temporal Alignment Fusion" introduces an innovative approach to tackle the limitations of spatial and temporal discontinuity in existing models. The proposed Spatio-Temporal Alignment Fusion Model (STAF) leverages attention-based mechanisms to enhance coherence and alignment across video frames, achieving superior results in terms of precision and smoothness.

Problem Statement and Motivation

Video-based human mesh recovery holds considerable promise for applications such as motion monitoring, virtual try-on, and VR. Despite the promising developments, traditional models often encounter issues related to the misalignment between mesh and image and temporal discontinuity. These shortcomings detract from the practical usability of such models, particularly in time-sensitive applications. The paper addresses these challenges by introducing a novel approach to embedding spatio-temporal coherence in human mesh recovery.

Methodology

The core contributions of this paper are encapsulated in the Spatio-Temporal Alignment Fusion Model (STAF). The methodology can be divided into three significant components: Temporal Coherence Fusion Module (TCFM), Spatial Alignment Fusion Module (SAFM), and the Average Pooling Module (APM).

Temporal Coherence Fusion Module (TCFM): This module enhances the model's ability to capture long-range temporal dependencies without sacrificing the spatial coherence of the features. Unlike conventional approaches that struggle with long-range dependencies, TCFM employs a self-attention mechanism, supplemented by an additional self-similarity matrix. This matrix guides the encoding process, preserving more accurate temporal correlations.

Spatial Alignment Fusion Module (SAFM): The SAFM focuses on enhancing the spatial feature representation of each target frame by leveraging a multi-stage adjacent feature fusion mechanism. By incorporating human spatial information extracted through projection sampling of initial meshes on feature maps, the module refines the mesh alignment cues effectively.

Average Pooling Module (APM): To address temporal discontinuity, the APM reduces the target frame's over-reliance on positional information by pooling features across the entire input sequence. This not only significantly enhances smoothness, but also improves the overall robustness and precision of the recovered meshes.

Experimental Evaluation

The experimental validation of STAF was conducted on three standard benchmark datasets: 3DPW, MPII3D, and Human 3.6M. Compared to state-of-the-art models such as VIBE, TCMR, and MPS-Net, STAF demonstrated superior performance in terms of PA-MPJPE, MPJPE, and PVE, while achieving a better trade-off between precision and smoothness.

Results on 3DPW and MPII3D: On 3DPW, STAF achieved a PA-MPJPE of 48.0 mm, an MPJPE of 80.6 mm, and a PVE of 95.3 mm. These metrics indicated improvements over previous models like MPS-Net. Additionally, the acceleration error of STAF was found to be 8.2 mm/s², reflecting significant reductions in temporal jitter.

Results on Human 3.6M: Evaluations on Human 3.6M confirmed the robustness of STAF, with a PA-MPJPE of 44.5 mm and an MPJPE of 70.4 mm. Although the acceleration error was slightly higher than in models like TCMR and MPS-Net, the precision metrics highlighted the advantages of incorporating spatio-temporal alignment.

Implications and Future Work

The development of STAF provides a critical stepping stone in video-based human mesh recovery, addressing long-standing issues of temporal and spatial coherence. Practically, this can benefit applications requiring high precision and smoothness in human motion, such as VR, gaming, and surveillance systems.

Theoretically, the introduction of mechanisms like TCFM and SAFM paves the way for further research in integrating temporal and spatial data effectively. Future developments may explore the refinement of these modules or their application to other domains requiring spatio-temporal data processing. Exploring larger datasets and more diverse scenarios will also help generalize the approach and validate its applicability across various environments.

In conclusion, the STAF model presents a sophisticated and effective solution to the challenges in 3D human mesh recovery from video, demonstrating notable improvements in both precision and temporal smoothness. This work not only contributes to the immediate goals of human-centered computer vision but also opens avenues for future innovations in the field.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com