SMPLer: Taming Transformers for Monocular 3D Human Shape and Pose Estimation (2404.15276v1)
Abstract: Existing Transformers for monocular 3D human shape and pose estimation typically have a quadratic computation and memory complexity with respect to the feature length, which hinders the exploitation of fine-grained information in high-resolution features that is beneficial for accurate reconstruction. In this work, we propose an SMPL-based Transformer framework (SMPLer) to address this issue. SMPLer incorporates two key ingredients: a decoupled attention operation and an SMPL-based target representation, which allow effective utilization of high-resolution features in the Transformer. In addition, based on these two designs, we also introduce several novel modules including a multi-scale attention and a joint-aware attention to further boost the reconstruction performance. Extensive experiments demonstrate the effectiveness of SMPLer against existing 3D human shape and pose estimation methods both quantitatively and qualitatively. Notably, the proposed algorithm achieves an MPJPE of 45.2 mm on the Human3.6M dataset, improving upon Mesh Graphormer by more than 10% with fewer than one-third of the parameters. Code and pretrained models are available at https://github.com/xuxy09/SMPLer.
- K. Lin, L. Wang, and Z. Liu, “Mesh graphormer,” in ICCV, 2021.
- ——, “End-to-end human pose and mesh reconstruction with transformers,” in CVPR, 2021.
- F. Bogo, A. Kanazawa, C. Lassner, P. Gehler, J. Romero, and M. J. Black, “Keep it smpl: Automatic estimation of 3d human pose and shape from a single image,” in ECCV, 2016.
- A. Kanazawa, M. J. Black, D. W. Jacobs, and J. Malik, “End-to-end recovery of human shape and pose,” in CVPR, 2018.
- M. Kocabas, N. Athanasiou, and M. J. Black, “Vibe: Video inference for human body pose and shape estimation,” in CVPR, 2020.
- N. Kolotouros, G. Pavlakos, M. J. Black, and K. Daniilidis, “Learning to reconstruct 3D human pose and shape via model-fitting in the loop,” in ICCV, 2019.
- N. Kolotouros, G. Pavlakos, and K. Daniilidis, “Convolutional mesh regression for single-image human shape reconstruction,” in CVPR, 2019.
- J. Rajasegaran, G. Pavlakos, A. Kanazawa, and J. Malik, “Tracking people with 3d representations,” in NeurIPS, 2021.
- ——, “Tracking people by predicting 3d appearance, location and pose,” in CVPR, 2022.
- J. Wang, Y. Zhong, Y. Li, C. Zhang, and Y. Wei, “Re-identification supervised texture generation,” in CVPR, 2019.
- X. Xu and C. C. Loy, “3d human texture estimation from a single image with transformers,” in ICCV, 2021.
- Z. Zheng, T. Yu, Y. Liu, and Q. Dai, “PaMIR: Parametric model-conditioned implicit representation for image-based human reconstruction,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
- A. Mir, T. Alldieck, and G. Pons-Moll, “Learning to transfer texture from clothing images to 3D humans,” in CVPR, 2020.
- Q. Ma, J. Yang, S. Tang, and M. J. Black, “The power of points for modeling humans in clothing,” in ICCV, 2021.
- T. Alldieck, M. Magnor, B. L. Bhatnagar, C. Theobalt, and G. Pons-Moll, “Learning to reconstruct people in clothing from a single rgb camera,” in CVPR, 2019.
- T. Alldieck, M. Magnor, W. Xu, C. Theobalt, and G. Pons-Moll, “Video based reconstruction of 3d people models,” in CVPR, 2018.
- T. Alldieck, G. Pons-Moll, C. Theobalt, and M. Magnor, “Tex2shape: Detailed full human body geometry from a single image,” in ICCV, 2019.
- T. Alldieck, M. Magnor, W. Xu, C. Theobalt, and G. Pons-Moll, “Detailed human avatars from monocular video,” in 3DV, 2018.
- R. Li, S. Yang, D. A. Ross, and A. Kanazawa, “Ai choreographer: Music conditioned 3d dance generation with aist++,” in ICCV, 2021.
- F. Hong, M. Zhang, L. Pan, Z. Cai, L. Yang, and Z. Liu, “Avatarclip: Zero-shot text-driven generation and animation of 3d avatars,” ACM Transactions on Graphics (SIGGRAPH), vol. 41, no. 4, pp. 1–19, 2022.
- S. Sanyal, A. Vorobiov, T. Bolkart, M. Loper, B. Mohler, L. S. Davis, J. Romero, and M. J. Black, “Learning realistic human reposing using cyclic self-supervision with 3d shape, pose, and appearance consistency,” in ICCV, 2021.
- A. Grigorev, K. Iskakov, A. Ianina, R. Bashirov, I. Zakharkin, A. Vakhitov, and V. Lempitsky, “Stylepeople: A generative model of fullbody human avatars,” in CVPR, 2021.
- S. Peng, Y. Zhang, Y. Xu, Q. Wang, Q. Shuai, H. Bao, and X. Zhou, “Neural body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans,” in CVPR, 2021.
- Y. Kwon, D. Kim, D. Ceylan, and H. Fuchs, “Neural human performer: Learning generalizable radiance fields for human performance rendering,” in NeurIPS, 2021.
- M. Chen, J. Zhang, X. Xu, L. Liu, Y. Cai, J. Feng, and S. Yan, “Geometry-guided progressive nerf for generalizable and efficient neural human rendering,” in ECCV, 2022.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in NeurIPS, 2017.
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in NAACL, 2019.
- A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in ICLR, 2021.
- B. Bosquet, M. Mucientes, and V. M. Brea, “Stdnet: Exploiting high resolution feature maps for small object detection,” Engineering Applications of Artificial Intelligence, vol. 91, p. 103615, 2020.
- K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution representation learning for human pose estimation,” in CVPR, 2019.
- J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in CVPR, 2015.
- M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black, “Smpl: A skinned multi-person linear model,” ACM Transactions on Graphics (SIGGRAPH Asia), vol. 34, no. 6, p. 248, 2015.
- C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu, “Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 7, pp. 1325–1339, 2013.
- H.-Y. Tung, H.-W. Tung, E. Yumer, and K. Fragkiadaki, “Self-supervised learning of motion capture,” in NeurIPS, 2017.
- G. Pavlakos, L. Zhu, X. Zhou, and K. Daniilidis, “Learning to estimate 3d human pose and shape from a single color image,” in CVPR, 2018.
- C. Lassner, J. Romero, M. Kiefel, F. Bogo, M. J. Black, and P. V. Gehler, “Unite the people: Closing the loop between 3d and 2d human representations,” in CVPR, 2017.
- M. Omran, C. Lassner, G. Pons-Moll, P. Gehler, and B. Schiele, “Neural body fitting: Unifying deep learning and model based human pose and shape estimation,” in 3DV, 2018.
- R. A. Guler and I. Kokkinos, “Holopose: Holistic 3d human reconstruction in-the-wild,” in CVPR, 2019.
- Y. Xu, S.-C. Zhu, and T. Tung, “Denserac: Joint 3d pose and shape estimation by dense render-and-compare,” in ICCV, 2019.
- W. Jiang, N. Kolotouros, G. Pavlakos, X. Zhou, and K. Daniilidis, “Coherent reconstruction of multiple humans from a single image,” in CVPR, 2020.
- T. Zhang, B. Huang, and Y. Wang, “Object-occluded human shape and pose estimation from a single color image,” in CVPR, 2020.
- W. Zeng, W. Ouyang, P. Luo, W. Liu, and X. Wang, “3d human mesh regression with dense correspondence,” in CVPR, 2020.
- A. Zanfir, E. G. Bazavan, H. Xu, W. T. Freeman, R. Sukthankar, and C. Sminchisescu, “Weakly supervised 3d human pose and shape reconstruction with normalizing flows,” in ECCV, 2020.
- J. Song, X. Chen, and O. Hilliges, “Human body model fitting by learned gradient descent,” in ECCV, 2020.
- H. Choi, G. Moon, and K. M. Lee, “Pose2mesh: Graph convolutional network for 3d human pose and mesh recovery from a 2d human pose,” in ECCV, 2020.
- G. Georgakis, R. Li, S. Karanam, T. Chen, J. Košecká, and Z. Wu, “Hierarchical kinematic human mesh recovery,” in ECCV, 2020.
- G. Moon and K. M. Lee, “I2l-meshnet: Image-to-lixel prediction network for accurate 3d human pose and mesh estimation from a single rgb image,” in ECCV, 2020.
- H. Zhang, J. Cao, G. Lu, W. Ouyang, and Z. Sun, “Learning 3d human shape and pose from dense body parts,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
- G. Moon, H. Choi, and K. M. Lee, “Accurate 3d hand pose estimation for whole-body 3d human mesh estimation,” in CVPR Workshops, 2022.
- J. Li, C. Xu, Z. Chen, S. Bian, L. Yang, and C. Lu, “Hybrik: A hybrid analytical-neural inverse kinematics solution for 3d human pose and shape estimation,” in CVPR, 2021.
- A. Sengupta, I. Budvytis, and R. Cipolla, “Probabilistic 3d human shape and pose estimation from multiple unconstrained images in the wild,” in CVPR, 2021.
- I. Akhter and M. J. Black, “Pose-conditioned joint angle limits for 3d human pose reconstruction,” in CVPR, 2015.
- A. Zanfir, E. G. Bazavan, M. Zanfir, W. T. Freeman, R. Sukthankar, and C. Sminchisescu, “Neural descent for visual 3d human pose and shape,” in CVPR, 2021.
- H. Joo, N. Neverova, and A. Vedaldi, “Exemplar fine-tuning for 3d human model fitting towards in-the-wild 3d human pose estimation,” in 3DV, 2021.
- N. Kolotouros, G. Pavlakos, D. Jayaraman, and K. Daniilidis, “Probabilistic modeling for human mesh recovery,” in ICCV, 2021.
- S. K. Dwivedi, N. Athanasiou, M. Kocabas, and M. J. Black, “Learning to regress bodies from images using differentiable semantic rendering,” in ICCV, 2021.
- Y. Sun, Q. Bao, W. Liu, Y. Fu, M. J. Black, and T. Mei, “Monocular, one-stage, regression of multiple 3d people,” in ICCV, 2021.
- M. Zanfir, A. Zanfir, E. G. Bazavan, W. T. Freeman, R. Sukthankar, and C. Sminchisescu, “Thundr: Transformer-based 3d human reconstruction with markers,” in ICCV, 2021.
- H. Zhang, Y. Tian, X. Zhou, W. Ouyang, Y. Liu, L. Wang, and Z. Sun, “Pymaf: 3d human pose and shape regression with pyramidal mesh alignment feedback loop,” in ICCV, 2021.
- M. Kocabas, C.-H. P. Huang, J. Tesch, L. Müller, O. Hilliges, and M. J. Black, “Spec: Seeing people in the wild with an estimated camera,” in ICCV, 2021.
- M. Kocabas, C.-H. P. Huang, O. Hilliges, and M. J. Black, “Pare: Part attention regressor for 3d human body estimation,” in ICCV, 2021.
- A. Kanazawa, J. Y. Zhang, P. Felsen, and J. Malik, “Learning 3d human dynamics from video,” in CVPR, 2019.
- A. Arnab, C. Doersch, and A. Zisserman, “Exploiting temporal context for 3d human pose estimation in the wild,” in CVPR, 2019.
- Y. Sun, Y. Ye, W. Liu, W. Gao, Y. Fu, and T. Mei, “Human mesh recovery from monocular images via a skeleton-disentangled representation,” in ICCV, 2019.
- C. Doersch and A. Zisserman, “Sim2real transfer learning for 3d human pose estimation: motion to the rescue,” in NeurIPS, 2019.
- Z. Luo, S. A. Golestaneh, and K. M. Kitani, “3d human motion estimation via motion compression and refinement,” in ACCV, 2020.
- H. Choi, G. Moon, J. Y. Chang, and K. M. Lee, “Beyond static features for temporally consistent 3d human pose and shape from a video,” in CVPR, 2021.
- G.-H. Lee and S.-W. Lee, “Uncertainty-aware human mesh recovery from video by learning part-based 3d dynamics,” in ICCV, 2021.
- Z. Wan, Z. Li, M. Tian, J. Liu, S. Yi, and H. Li, “Encoder-decoder with multi-level attention for 3d human shape and pose estimation,” in ICCV, 2021.
- X. Xu, H. Chen, F. Moreno-Noguer, L. A. Jeni, and F. De la Torre, “3D human shape and pose from a single low-resolution image with self-supervised learning,” in ECCV, 2020.
- ——, “3D human pose, shape and texture from low-resolution images and videos,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
- F. Moreno-Noguer, “3d human pose estimation from a single image via distance matrix regression,” in CVPR, 2017.
- Y. Rong, T. Shiratori, and H. Joo, “Frankmocap: A monocular 3d whole-body pose estimation system via regression and integration,” in ICCV Workshops, 2021.
- A. Davydov, A. Remizova, V. Constantin, S. Honari, M. Salzmann, and P. Fua, “Adversarial parametric pose prior,” in CVPR, 2022.
- G. Tiwari, D. Antic, J. E. Lenssen, N. Sarafianos, T. Tung, and G. Pons-Moll, “Pose-ndf: Modeling human pose manifolds with neural distance fields,” in ECCV, 2022.
- “CMU graphics lab motion capture database,” http://mocap.cs.cmu.edu/, 2010.
- L. Liu, X. Xu, Z. Lin, J. Liang, and S. Yan, “Towards garment sewing pattern reconstruction from a single image,” ACM Transactions on Graphics (SIGGRAPH Asia), 2023.
- Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in ICCV, 2021.
- N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in ECCV, 2020.
- H. Chen, Y. Wang, T. Guo, C. Xu, Y. Deng, Z. Liu, S. Ma, C. Xu, C. Xu, and W. Gao, “Pre-trained image processing transformer,” in CVPR, 2021.
- Z. Shi, X. Xu, X. Liu, J. Chen, and M.-H. Yang, “Video frame interpolation transformer,” in CVPR, 2022.
- H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena, “Self-attention generative adversarial networks,” in ICML, 2019.
- Y. Jiang, S. Chang, and Z. Wang, “Transgan: Two pure transformers can make one strong gan, and that can scale up,” in NeurIPS, 2021.
- J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv: 1607.06450, 2016.
- J. S. Dai, “Euler–rodrigues formula variations, quaternion conjugation and intrinsic connections,” Mechanism and Machine Theory, vol. 92, pp. 144–152, 2015.
- I. T. Jolliffe and J. Cadima, “Principal component analysis: a review and recent developments,” Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, vol. 374, no. 2065, p. 20150202, 2016.
- Y. Zhou, C. Barnes, J. Lu, J. Yang, and H. Li, “On the continuity of rotation representations in neural networks,” in CVPR, 2019.
- T. von Marcard, R. Henschel, M. J. Black, B. Rosenhahn, and G. Pons-Moll, “Recovering accurate 3d human pose in the wild using imus and a moving camera,” in ECCV, 2018.
- C. Zhang, A. Gupta, and A. Zisserman, “Temporal query networks for fine-grained video understanding,” in CVPR, 2021.
- C. Zheng, X. Liu, G.-J. Qi, and C. Chen, “Potter: Pooling attention transformer for efficient human mesh recovery,” in CVPR, 2023.
- D. Mehta, O. Sotnychenko, F. Mueller, W. Xu, S. Sridhar, G. Pons-Moll, and C. Theobalt, “Single-shot multi-person 3d pose estimation from monocular rgb,” in 3DV, 2018.
- T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in ECCV, 2014.
- M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele, “2d human pose estimation: New benchmark and state of the art analysis,” in CVPR, 2014.
- P. H. Schönemann, “A generalized solution of the orthogonal procrustes problem,” Psychometrika, vol. 31, no. 1, pp. 1–10, 1966.
- “Mixamo,” https://www.mixamo.com/.
- D. Sun, X. Yang, M.-Y. Liu, and J. Kautz, “Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume,” in CVPR, 2018.
- S. Nah, T. H. Kim, and K. M. Lee, “Deep multi-scale convolutional neural network for dynamic scene deblurring,” in CVPR, 2017.
- B. Biggs, O. Boyne, J. Charles, A. Fitzgibbon, and R. Cipolla, “Who left the dogs out?: 3D animal reconstruction with expectation maximization in the loop,” in ECCV, 2020.
- G. Pavlakos, V. Choutas, N. Ghorbani, T. Bolkart, A. A. Osman, D. Tzionas, and M. J. Black, “Expressive body capture: 3d hands, face, and body from a single image,” in CVPR, 2019.
- S. Zuffi, A. Kanazawa, D. Jacobs, and M. J. Black, “3D menagerie: Modeling the 3D shape and pose of animals,” in CVPR, 2017.
- J. Romero, D. Tzionas, and M. J. Black, “Embodied hands: Modeling and capturing hands and bodies together,” ACM Transactions on Graphics (SIGGRAPH Asia), vol. 36, no. 6, 2017.
- Y. Yuan, R. Fu, L. Huang, W. Lin, C. Zhang, X. Chen, and J. Wang, “Hrformer: High-resolution transformer for dense prediction,” in NeurIPS, 2021.