STAF: 3D Human Mesh Recovery from Video with Spatio-Temporal Alignment Fusion (2401.01730v1)
Abstract: The recovery of 3D human mesh from monocular images has significantly been developed in recent years. However, existing models usually ignore spatial and temporal information, which might lead to mesh and image misalignment and temporal discontinuity. For this reason, we propose a novel Spatio-Temporal Alignment Fusion (STAF) model. As a video-based model, it leverages coherence clues from human motion by an attention-based Temporal Coherence Fusion Module (TCFM). As for spatial mesh-alignment evidence, we extract fine-grained local information through predicted mesh projection on the feature maps. Based on the spatial features, we further introduce a multi-stage adjacent Spatial Alignment Fusion Module (SAFM) to enhance the feature representation of the target frame. In addition to the above, we propose an Average Pooling Module (APM) to allow the model to focus on the entire input sequence rather than just the target frame. This method can remarkably improve the smoothness of recovery results from video. Extensive experiments on 3DPW, MPII3D, and H36M demonstrate the superiority of STAF. We achieve a state-of-the-art trade-off between precision and smoothness. Our code and more video results are on the project page https://yw0208.github.io/staf/
- Z. Luo, S. A. Golestaneh, and K. M. Kitani, “3d human motion estimation via motion compression and refinement,” in Proceedings of the Asian Conference on Computer Vision, 2020.
- G. T. Papadopoulos and P. Daras, “Human action recognition using 3d reconstruction data,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 28, no. 8, pp. 1807–1823, 2018.
- G. Wei, C. Lan, W. Zeng, and Z. Chen, “View invariant 3d human pose estimation,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 12, pp. 4601–4610, 2020.
- R. Gu, G. Wang, Z. Jiang, and J.-N. Hwang, “Multi-person hierarchical 3d pose estimation in natural videos,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 11, pp. 4245–4257, 2020.
- T. Chen, C. Fang, X. Shen, Y. Zhu, Z. Chen, and J. Luo, “Anatomy-aware 3d human pose estimation with bone-based pose decomposition,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 1, pp. 198–209, 2022.
- Y. Tian, H. Zhang, Y. Liu, and L. Wang, “Recovering 3d human mesh from monocular images: A survey,” IEEE transactions on pattern analysis and machine intelligence, 2023.
- M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black, “Smpl: A skinned multi-person linear model,” ACM transactions on graphics (TOG), vol. 34, no. 6, pp. 1–16, 2015.
- H. Choi, G. Moon, J. Y. Chang, and K. M. Lee, “Beyond static features for temporally consistent 3d human pose and shape from a video,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 1964–1973.
- C. Doersch and A. Zisserman, “Sim2real transfer learning for 3d human pose estimation: motion to the rescue,” Advances in Neural Information Processing Systems, vol. 32, 2019.
- A. Kanazawa, J. Y. Zhang, P. Felsen, and J. Malik, “Learning 3d human dynamics from video,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 5614–5623.
- M. Kocabas, N. Athanasiou, and M. J. Black, “Vibe: Video inference for human body pose and shape estimation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 5253–5263.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
- X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7794–7803.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
- K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution representation learning for human pose estimation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 5693–5703.
- O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention. Springer, 2015, pp. 234–241.
- A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks for human pose estimation,” in European conference on computer vision. Springer, 2016, pp. 483–499.
- T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2117–2125.
- J. Zhang, D. Yu, J. H. Liew, X. Nie, and J. Feng, “Body meshes as points,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 546–556.
- M. Omran, C. Lassner, G. Pons-Moll, P. Gehler, and B. Schiele, “Neural body fitting: Unifying deep learning and model based human pose and shape estimation,” in 2018 international conference on 3D vision (3DV). IEEE, 2018, pp. 484–494.
- A. Zanfir, E. G. Bazavan, M. Zanfir, W. T. Freeman, R. Sukthankar, and C. Sminchisescu, “Neural descent for visual 3d human pose and shape,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 14 484–14 493.
- M. Zanfir, A. Zanfir, E. G. Bazavan, W. T. Freeman, R. Sukthankar, and C. Sminchisescu, “Thundr: Transformer-based 3d human reconstruction with markers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 12 971–12 980.
- H. Zhang, Y. Tian, X. Zhou, W. Ouyang, Y. Liu, L. Wang, and Z. Sun, “Pymaf: 3d human pose and shape regression with pyramidal mesh alignment feedback loop,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 11 446–11 456.
- Y. Xu, S.-C. Zhu, and T. Tung, “Denserac: Joint 3d pose and shape estimation by dense render-and-compare,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 7760–7770.
- H. Zhang, J. Cao, G. Lu, W. Ouyang, and Z. Sun, “Learning 3d human shape and pose from dense body parts,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
- B. Huang, T. Zhang, and Y. Wang, “Pose2uv: Single-shot multiperson mesh recovery with deep uv prior,” IEEE Transactions on Image Processing, vol. 31, pp. 4679–4692, 2022.
- T. Alldieck, M. Kassubeck, B. Wandt, B. Rosenhahn, and M. Magnor, “Optical flow-based 3d human motion estimation from monocular video,” in German Conference on Pattern Recognition. Springer, 2017, pp. 347–360.
- A. Arnab, C. Doersch, and A. Zisserman, “Exploiting temporal context for 3d human pose estimation in the wild,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 3395–3404.
- Z. Li, B. Xu, H. Huang, C. Lu, and Y. Guo, “Deep two-stream video inference for human body pose and shape estimation,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2022, pp. 430–439.
- D. Anguelov, P. Srinivasan, D. Koller, S. Thrun, J. Rodgers, and J. Davis, “Scape: shape completion and animation of people,” in ACM SIGGRAPH 2005 Papers, 2005, pp. 408–416.
- H. Xu, E. G. Bazavan, A. Zanfir, W. T. Freeman, R. Sukthankar, and C. Sminchisescu, “Ghum & ghuml: Generative 3d human shape and articulated pose models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 6184–6193.
- C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu, “Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments,” IEEE transactions on pattern analysis and machine intelligence, vol. 36, no. 7, pp. 1325–1339, 2013.
- T. Von Marcard, R. Henschel, M. J. Black, B. Rosenhahn, and G. Pons-Moll, “Recovering accurate 3d human pose in the wild using imus and a moving camera,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 601–617.
- D. Mehta, H. Rhodin, D. Casas, P. Fua, O. Sotnychenko, W. Xu, and C. Theobalt, “Monocular 3d human pose estimation in the wild using improved cnn supervision,” in 2017 international conference on 3D vision (3DV). IEEE, 2017, pp. 506–516.
- L. Sigal, A. Balan, and M. Black, “Combined discriminative and generative articulated pose and non-rigid shape estimation,” Advances in neural information processing systems, vol. 20, 2007.
- P. Guan, A. Weiss, A. O. Balan, and M. J. Black, “Estimating human shape and pose from a single image,” in 2009 IEEE 12th International Conference on Computer Vision. IEEE, 2009, pp. 1381–1388.
- F. Bogo, A. Kanazawa, C. Lassner, P. Gehler, J. Romero, and M. J. Black, “Keep it smpl: Automatic estimation of 3d human pose and shape from a single image,” in European conference on computer vision. Springer, 2016, pp. 561–578.
- A. Kanazawa, M. J. Black, D. W. Jacobs, and J. Malik, “End-to-end recovery of human shape and pose,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7122–7131.
- N. Kolotouros, G. Pavlakos, and K. Daniilidis, “Convolutional mesh regression for single-image human shape reconstruction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 4501–4510.
- P. Yao, Z. Fang, F. Wu, Y. Feng, and J. Li, “Densebody: Directly regressing dense 3d human pose and shape from a single color image,” arXiv preprint arXiv:1903.10153, 2019.
- H. Zhang, Y. Tian, Y. Zhang, M. Li, L. An, Z. Sun, and Y. Liu, “Pymaf-x: Towards well-aligned full-body model regression from monocular images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
- Z. Li, J. Liu, Z. Zhang, S. Xu, and Y. Yan, “Cliff: Carrying location information in full frames into human pose and shape estimation,” in European Conference on Computer Vision. Springer, 2022, pp. 590–606.
- K. Lin, L. Wang, and Z. Liu, “Mesh graphormer,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 12 939–12 948.
- N. Kolotouros, G. Pavlakos, D. Jayaraman, and K. Daniilidis, “Probabilistic modeling for human mesh recovery,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 11 605–11 614.
- Y. Sun, Q. Bao, W. Liu, Y. Fu, M. J. Black, and T. Mei, “Monocular, one-stage, regression of multiple 3d people,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 11 179–11 188.
- L. Wang, X. Liu, X. Ma, J. Wu, J. Cheng, and M. Zhou, “A progressive quadric graph convolutional network for 3d human mesh recovery,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 1, pp. 104–117, 2023.
- N. Kolotouros, G. Pavlakos, M. J. Black, and K. Daniilidis, “Learning to reconstruct 3d human pose and shape via model-fitting in the loop,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 2252–2261.
- H. Joo, N. Neverova, and A. Vedaldi, “Exemplar fine-tuning for 3d human pose fitting towards in-the-wild 3d human pose estimation,” 2020.
- Z. Wan, Z. Li, M. Tian, J. Liu, S. Yi, and H. Li, “Encoder-decoder with multi-level attention for 3d human shape and pose estimation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 13 033–13 042.
- Y. Sun, Y. Ye, W. Liu, W. Gao, Y. Fu, and T. Mei, “Human mesh recovery from monocular images via a skeleton-disentangled representation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 5349–5358.
- W.-L. Wei, J.-C. Lin, T.-L. Liu, and H.-Y. M. Liao, “Capturing humans in motion: Temporal-attentive 3d human pose and shape estimation from monocular video,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 13 211–13 220.
- G. Papandreou, T. Zhu, N. Kanazawa, A. Toshev, J. Tompson, C. Bregler, and K. Murphy, “Towards accurate multi-person pose estimation in the wild,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4903–4911.
- T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer, 2014, pp. 740–755.
- S. Johnson and M. Everingham, “Clustered pose and nonlinear appearance models for human pose estimation.” in bmvc, vol. 2, no. 4. Aberystwyth, UK, 2010, p. 5.
- S. Johnson and M. Everingham, “Learning effective human pose estimation from inaccurate annotation,” in CVPR 2011, 2011, pp. 1465–1472.
- G. Moon, H. Choi, and K. M. Lee, “Neuralannot: Neural annotator for 3d human mesh training sets,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 2299–2307.
- M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele, “2d human pose estimation: New benchmark and state of the art analysis,” in Proceedings of the IEEE Conference on computer Vision and Pattern Recognition, 2014, pp. 3686–3693.
- M. Andriluka, U. Iqbal, E. Insafutdinov, L. Pishchulin, A. Milan, J. Gall, and B. Schiele, “Posetrack: A benchmark for human pose estimation and tracking,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 5167–5176.
- M. Kocabas, C.-H. P. Huang, O. Hilliges, and M. J. Black, “Pare: Part attention regressor for 3d human body estimation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 11 127–11 137.
- G. Moon and K. M. Lee, “I2l-meshnet: Image-to-lixel prediction network for accurate 3d human pose and mesh estimation from a single rgb image,” in European Conference on Computer Vision. Springer, 2020, pp. 752–768.
- T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European conference on computer vision. Springer, 2014, pp. 740–755.
- A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” 2017.