Recovering 3D Human Mesh from Monocular Images: A Survey (2203.01923v6)
Abstract: Estimating human pose and shape from monocular images is a long-standing problem in computer vision. Since the release of statistical body models, 3D human mesh recovery has been drawing broader attention. With the same goal of obtaining well-aligned and physically plausible mesh results, two paradigms have been developed to overcome challenges in the 2D-to-3D lifting process: i) an optimization-based paradigm, where different data terms and regularization terms are exploited as optimization objectives; and ii) a regression-based paradigm, where deep learning techniques are embraced to solve the problem in an end-to-end fashion. Meanwhile, continuous efforts are devoted to improving the quality of 3D mesh labels for a wide range of datasets. Though remarkable progress has been achieved in the past decade, the task is still challenging due to flexible body motions, diverse appearances, complex environments, and insufficient in-the-wild annotations. To the best of our knowledge, this is the first survey that focuses on the task of monocular 3D human mesh recovery. We start with the introduction of body models and then elaborate recovery frameworks and training objectives by providing in-depth analyses of their strengths and weaknesses. We also summarize datasets, evaluation metrics, and benchmark results. Open issues and future directions are discussed in the end, hoping to motivate researchers and facilitate their research in this area. A regularly updated project page can be found at https://github.com/tinatiansjz/hmr-survey.
- Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, and Y. Sheikh, “OpenPose: Realtime multi-person 2D pose estimation using part affinity fields,” TPAMI, vol. 43, no. 1, pp. 172–186, 2019.
- H.-S. Fang, S. Xie, Y.-W. Tai, and C. Lu, “RMPE: Regional multi-person pose estimation,” in ICCV, 2017, pp. 2334–2343.
- S. Kreiss, L. Bertoni, and A. Alahi, “OpenPifPaf: Composite fields for semantic keypoint detection and spatio-temporal association,” TITS, vol. 23, no. 8, pp. 13 498–13 511, 2021.
- Q. Chen, T. Ge, Y. Xu, Z. Zhang, X. Yang, and K. Gai, “Semantic human matting,” in ACM MM, 2018, pp. 618–626.
- J. Zhao, J. Li, Y. Cheng, T. Sim, S. Yan, and J. Feng, “Understanding humans in crowded scenes: Deep nested adversarial learning and a new benchmark for multi-human parsing,” in ACM MM, 2018, pp. 792–800.
- K. Grauman, G. Shakhnarovich, and T. Darrell, “Inferring 3D structure with a statistical image-based shape model,” in ICCV, 2003, pp. 641–648.
- A. Agarwal and B. Triggs, “Recovering 3D human pose from monocular images,” TPAMI, vol. 28, no. 1, pp. 44–58, 2005.
- J. Martinez, R. Hossain, J. Romero, and J. J. Little, “A simple yet effective baseline for 3D human pose estimation,” in ICCV, 2017, pp. 2659–2668.
- G. Pavlakos, X. Zhou, K. G. Derpanis, and K. Daniilidis, “Coarse-to-fine volumetric prediction for single-image 3D human pose,” in CVPR, 2017, pp. 1263–1272.
- X. Sun, B. Xiao, F. Wei, S. Liang, and Y. Wei, “Integral human pose regression,” in ECCV, 2018, pp. 536–553.
- D. Mehta, O. Sotnychenko, F. Mueller, W. Xu, M. Elgharib, P. Fua, H.-P. Seidel, H. Rhodin, G. Pons-Moll, and C. Theobalt, “XNect: Real-time multi-person 3D motion capture with a single RGB camera,” TOG, vol. 39, no. 4, pp. 82–1, 2020.
- P. Weinzaepfel, R. Brégier, H. Combaluzier, V. Leroy, and G. Rogez, “DOPE: Distillation of part experts for whole-body 3D pose estimation in the wild,” in ECCV, 2020, pp. 380–397.
- F. Bogo, A. Kanazawa, C. Lassner, P. Gehler, J. Romero, and M. J. Black, “Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image,” in ECCV. Springer, 2016, pp. 561–578.
- Y. Huang, F. Bogo, C. Lassner, A. Kanazawa, P. V. Gehler, J. Romero, I. Akhter, and M. J. Black, “Towards accurate marker-less human shape and pose estimation over time,” in 3DV. IEEE, 2017, pp. 421–430.
- A. Zanfir, E. Marinoiu, and C. Sminchisescu, “Monocular 3D pose and shape estimation of multiple people in natural scenes - the importance of multiple scene constraints,” in CVPR, 2018, pp. 2148–2157.
- A. Kanazawa, M. J. Black, D. W. Jacobs, and J. Malik, “End-to-end recovery of human shape and pose,” in CVPR, 2018, pp. 7122–7131.
- G. Pavlakos, L. Zhu, X. Zhou, and K. Daniilidis, “Learning to estimate 3D human pose and shape from a single color image,” in CVPR, 2018, pp. 459–468.
- M. Omran, C. Lassner, G. Pons-Moll, P. Gehler, and B. Schiele, “Neural Body Fitting: Unifying deep learning and model-based human pose and shape estimation,” in 3DV. IEEE, 2018, pp. 484–494.
- H. Zhang, Y. Tian, X. Zhou, W. Ouyang, Y. Liu, L. Wang, and Z. Sun, “PyMAF: 3D human pose and shape regression with pyramidal mesh alignment feedback loop,” in ICCV, 2021.
- M. Kocabas, C.-H. P. Huang, O. Hilliges, and M. J. Black, “PARE: Part attention regressor for 3D human body estimation,” in ICCV, 2021, pp. 11 127–11 137.
- H. Joo, N. Neverova, and A. Vedaldi, “Exemplar fine-tuning for 3D human pose fitting towards in-the-wild 3D human pose estimation,” in 3DV, 2021, pp. 42–52.
- G. Pavlakos, V. Choutas, N. Ghorbani, T. Bolkart, A. A. Osman, D. Tzionas, and M. J. Black, “Expressive body capture: 3D hands, face, and body from a single image,” in CVPR, 2019, pp. 10 975–10 985.
- V. Choutas, G. Pavlakos, T. Bolkart, D. Tzionas, and M. J. Black, “Monocular expressive body regression through body-driven attention,” in ECCV. Springer, 2020, pp. 20–40.
- Y. Feng, V. Choutas, T. Bolkart, D. Tzionas, and M. J. Black, “Collaborative regression of expressive bodies using moderation,” in 3DV, 2021.
- G. Moon, H. Choi, and K. M. Lee, “Accurate 3D hand pose estimation for whole-body 3D human mesh estimation,” in CVPRW, 2022, pp. 2308–2317.
- Y. Zhang, Z. Li, L. An, M. Li, T. Yu, and Y. Liu, “Lightweight multi-person total motion capture using sparse multi-view cameras,” in ICCV, 2021, pp. 5560–5569.
- M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black, “SMPL: A skinned multi-person linear model,” TOG, vol. 34, no. 6, pp. 1–16, 2015.
- T. Yu, Z. Zheng, K. Guo, J. Zhao, Q. Dai, H. Li, G. Pons-Moll, and Y. Liu, “DoubleFusion: Real-time capture of human performances with inner body shapes from a single depth sensor,” in CVPR, 2018, pp. 7287–7296.
- Z. Zheng, T. Yu, Y. Liu, and Q. Dai, “PaMIR: Parametric model-conditioned implicit representation for image-based human reconstruction,” TPAMI, 2021.
- Y. Zheng, R. Shao, Y. Zhang, T. Yu, Z. Zheng, Q. Dai, and Y. Liu, “DeepMultiCap: Performance capture of multiple characters using sparse multiview cameras,” in ICCV, 2021, pp. 6239–6249.
- K. Li, H. Wen, Q. Feng, Y. Zhang, X. Li, J. Huang, C. Yuan, Y.-K. Lai, and Y. Liu, “Image-guided human reconstruction via multi-scale graph transformation networks,” TIP, vol. 30, pp. 5239–5251, 2021.
- Q. Feng, Y. Liu, Y.-K. Lai, J. Yang, and K. Li, “FOF: Learning fourier occupancy field for monocular real-time human reconstruction,” in NeurIPS, 2022.
- Y. Xiu, J. Yang, D. Tzionas, and M. J. Black, “ICON: implicit clothed humans obtained from normals,” in CVPR, 2022, pp. 13 286–13 296.
- Y. Xiu, J. Yang, X. Cao, D. Tzionas, and M. J. Black, “ECON: Explicit clothed humans optimized via normal integration,” in CVPR, 2023, pp. 512–523.
- S. Peng, Y. Zhang, Y. Xu, Q. Wang, Q. Shuai, H. Bao, and X. Zhou, “Neural Body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans,” in CVPR, 2021, pp. 9054–9063.
- T. Hu, T. Yu, Z. Zheng, H. Zhang, Y. Liu, and M. Zwicker, “HVTR: Hybrid volumetric-textural rendering for human avatars,” in 3DV. IEEE, 2022, pp. 197–208.
- Z. Huang, Y. Xu, C. Lassner, H. Li, and T. Tung, “ARCH: Animatable reconstruction of clothed humans,” in CVPR, 2020, pp. 3093–3102.
- Q. Ma, J. Yang, S. Tang, and M. J. Black, “The power of points for modeling humans in clothing,” in ICCV, 2021, pp. 10 974–10 984.
- Z. Zheng, H. Huang, T. Yu, H. Zhang, Y. Guo, and Y. Liu, “Structured local radiance fields for human avatar modeling,” in CVPR, 2022, pp. 15 893–15 903.
- Z. Zheng, X. Zhao, H. Zhang, B. Liu, and Y. Liu, “AvatarReX: Real-time expressive full-body avatars,” ACM TOG, vol. 42, no. 4, 2023.
- D. Anguelov, P. Srinivasan, D. Koller, S. Thrun, J. Rodgers, and J. Davis, “SCAPE: shape completion and animation of people,” TOG, vol. 24, pp. 408–416, 2005.
- C. Lassner, J. Romero, M. Kiefel, F. Bogo, M. J. Black, and P. V. Gehler, “Unite the people: Closing the loop between 3D and 2D human representations,” in CVPR, 2017, pp. 6050–6059.
- H.-Y. F. Tung, H.-W. Tung, E. Yumer, and K. Fragkiadaki, “Self-supervised learning of motion capture,” NeurIPS, pp. 5236–5246, 2017.
- N. Kolotouros, G. Pavlakos, M. J. Black, and K. Daniilidis, “Learning to reconstruct 3D human pose and shape via model-fitting in the loop,” in ICCV, 2019, pp. 2252–2261.
- L. Chen, S. Peng, and X. Zhou, “Towards efficient and photorealistic 3D human reconstruction: A brief survey,” Visual Informatics, vol. 5, no. 4, pp. 11–19, 2021.
- A. Tewari, O. Fried, J. Thies, V. Sitzmann, S. Lombardi, K. Sunkavalli, R. Martin-Brualla, T. Simon, J. Saragih, M. Nießner et al., “State of the art on neural rendering,” in CGF, vol. 39. Wiley Online Library, 2020, pp. 701–727.
- Y. Chen, Y. Tian, and M. He, “Monocular human pose estimation: A survey of deep learning-based methods,” CVIU, vol. 192, p. 102897, 2020.
- C. Zheng, W. Wu, T. Yang, S. Zhu, C. Chen, R. Liu, J. Shen, N. Kehtarnavaz, and M. Shah, “Deep learning-based human pose estimation: A survey,” arXiv preprint arXiv:2012.13392, 2020.
- W. Liu, Q. Bao, Y. Sun, and T. Mei, “Recent advances of monocular 2D and 3D human pose estimation: A deep learning perspective,” ACM Computing Surveys, vol. 55, no. 4, pp. 1–41, 2022.
- H.-J. Lee and Z. Chen, “Determination of 3D human body postures from a single view,” Computer Vision, Graphics, and Image Processing, vol. 30, no. 2, pp. 148–168, 1985.
- R. Nevatia and T. O. Binford, “Description and recognition of curved objects,” Artificial intelligence, vol. 8, no. 1, pp. 77–98, 1977.
- S. X. Ju, M. J. Black, and Y. Yacoob, “Cardboard people: A parameterized model of articulated image motion,” in FG. IEEE, 1996, pp. 38–44.
- V. Blanz and T. Vetter, “A morphable model for the synthesis of 3D faces,” in SIGGRAPH, 1999, pp. 187–194.
- D. Marr and H. K. Nishihara, “Representation and recognition of the spatial organization of three-dimensional shapes,” Proceedings of the Royal Society of London. Series B. Biological Sciences, vol. 200, no. 1140, pp. 269–294, 1978.
- K. Rohr, “Towards model-based recognition of human movements in image sequences,” CVGIP: Image understanding, vol. 59, no. 1, pp. 94–115, 1994.
- S. Wachter and H.-H. Nagel, “Tracking of persons in monocular image sequences,” CVIU, vol. 74, no. 3, pp. 174–192, 1999.
- H. Sidenbladh, M. J. Black, and D. J. Fleet, “Stochastic tracking of 3D human figures using 2D image motion,” in ECCV. Springer, 2000, pp. 702–718.
- L. Sigal, A. O. Balan, and M. J. Black, “HumanEva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion,” IJCV, vol. 87, no. 1-2, p. 4, 2010.
- M. Wang, F. Qiu, W. Liu, C. Qian, X. Zhou, and L. Ma, “Monocular human pose and shape reconstruction using part differentiable rendering,” in CGF, vol. 39. Wiley Online Library, 2020, pp. 351–362.
- A. Pentland and B. Horowitz, “Recovery of nonrigid motion and structure,” TPAMI, vol. 13, no. 07, pp. 730–742, 1991.
- D. Metaxas and D. Terzopoulos, “Shape and nonrigid motion estimation through physics-based synthesis,” TPAMI, vol. 15, no. 6, pp. 580–591, 1993.
- C. Sminchisescu and B. Triggs, “Estimating articulated human motion with covariance scaled sampling,” International Journal of Robotics Research, vol. 22, no. 6, pp. 371–391, 2003.
- R. Plänkers and P. Fua, “Tracking and modeling people in video sequences,” CVIU, vol. 81, no. 3, pp. 285–302, 2001.
- L. Kakadiaris and D. Metaxas, “Model-based estimation of 3D human motion,” TPAMI, vol. 22, no. 12, pp. 1453–1459, 2000.
- G. Pons-Moll and B. Rosenhahn, “Model-based pose estimation,” Visual Analysis of Humans, pp. 139–170, 2011.
- B. Allen, B. Curless, and Z. Popović, “The space of human body shapes: reconstruction and parameterization from range scans,” TOG, vol. 22, no. 3, pp. 587–594, 2003.
- N. Hasler, C. Stoll, M. Sunkel, B. Rosenhahn, and H.-P. Seidel, “A statistical model of human pose and body shape,” in CGF, vol. 28. Wiley Online Library, 2009, pp. 337–346.
- Y. Chen, Z. Liu, and Z. Zhang, “Tensor-based human body modeling,” in CVPR, 2013, pp. 105–112.
- O. Freifeld and M. J. Black, “Lie bodies: A manifold representation of 3D human shape,” in ECCV. Springer, 2012, pp. 1–14.
- D. A. Hirshberg, M. Loper, E. Rachlin, and M. J. Black, “Coregistration: Simultaneous alignment and modeling of articulated 3D shape,” in ECCV. Springer, 2012, pp. 242–255.
- G. Pons-Moll, J. Romero, N. Mahmood, and M. J. Black, “Dyna: A model of dynamic human shape in motion,” TOG, vol. 34, no. 4, pp. 1–14, 2015.
- B. Allen, B. Curless, Z. Popović, and A. Hertzmann, “Learning a correlated model of identity and pose-dependent body shape variation for real-time synthesis,” in SCA. ACM, 2006, pp. 147–156.
- N. Hasler, T. Thormählen, B. Rosenhahn, and H.-P. Seidel, “Learning skeletons for shape and pose,” in I3D, 2010, pp. 23–30.
- H. Wang, R. A. Güler, I. Kokkinos, G. Papandreou, and S. Zafeiriou, “BLSM: A bone-level skinned model of the human mesh,” in ECCV. Springer, 2020, pp. 1–17.
- K. M. Robinette, S. Blackwell, H. Daanen, M. Boehmer, S. Fleming, T. Brill, D. Hoeferlin, and D. Burnsides, “Civilian American and European Surface Anthropometry Resource (CAESAR) final report,” US Air Force Research Laboratory, Tech. Rep. AFRL-HE-WP-TR-2002-0169, 2002.
- S. Zuffi and M. J. Black, “The stitched puppet: A graphical model of 3D human shape and pose,” in CVPR, 2015, pp. 3537–3546.
- A. Mohr and M. Gleicher, “Building efficient, accurate character skins from examples,” TOG, vol. 22, no. 3, pp. 562–568, 2003.
- T. Li, T. Bolkart, M. J. Black, H. Li, and J. Romero, “Learning a model of facial shape and expression from 2D scans,” TOG, vol. 36, no. 6, pp. 194–1, 2017.
- J. Romero, D. Tzionas, and M. J. Black, “Embodied hands: Modeling and capturing hands and bodies together,” TOG, vol. 36, no. 6, pp. 1–17, 2017.
- N. Hesse, S. Pujades, J. Romero, M. J. Black, C. Bodensteiner, M. Arens, U. G. Hofmann, U. Tacke, M. Hadders-Algra, R. Weinberger et al., “Learning an infant body model from RGB-D data for accurate full body motion analysis,” in MICCAI. Springer, 2018, pp. 792–800.
- I. Santesteban, E. Garces, M. A. Otaduy, and D. Casas, “SoftSMPL: Data-driven modeling of nonlinear soft-tissue dynamics for parametric humans,” in CGF, vol. 39. Wiley Online Library, 2020, pp. 65–75.
- A. A. Osman, T. Bolkart, and M. J. Black, “STAR: Sparse trained articulated human body regressor,” in ECCV. Springer, 2020, pp. 598–613.
- B. Deng, J. P. Lewis, T. Jeruzalski, G. Pons-Moll, G. Hinton, M. Norouzi, and A. Tagliasacchi, “NASA: Neural articulated shape approximation,” in ECCV. Springer, 2020, pp. 612–628.
- M. Mihajlovic, Y. Zhang, M. J. Black, and S. Tang, “LEAP: Learning articulated occupancy of people,” in CVPR, 2021, pp. 10 461–10 471.
- X. Chen, Y. Zheng, M. J. Black, O. Hilliges, and A. Geiger, “SNARF: Differentiable forward skinning for animating non-rigid neural implicit shapes,” in ICCV, 2021, pp. 11 594–11 604.
- M. Mihajlovic, S. Saito, A. Bansal, M. Zollhoefer, and S. Tang, “COAP: Compositional articulated occupancy of people,” in CVPR, 2022, pp. 13 201–13 210.
- X. Sun, Q. Feng, X. Li, J. Zhang, Y.-K. Lai, J. Yang, and K. Li, “Learning semantic-aware disentangled representation for flexible 3D human body editing,” in CVPR, 2023.
- H. Joo, T. Simon, and Y. Sheikh, “Total Capture: A 3D deformation model for tracking faces, hands, and bodies,” in CVPR, 2018, pp. 8320–8329.
- C. Cao, Y. Weng, S. Zhou, Y. Tong, and K. Zhou, “FaceWarehouse: a 3D facial expression database for visual computing,” TVCG, vol. 20, no. 3, pp. 413–425, 2013.
- A. A. Osman, T. Bolkart, D. Tzionas, and M. J. Black, “SUPR: A sparse unified part-based human representation,” in ECCV. Springer, 2022, pp. 568–585.
- H. Xu, E. G. Bazavan, A. Zanfir, W. T. Freeman, R. Sukthankar, and C. Sminchisescu, “GHUM & GHUML: Generative 3D human shape and articulated pose models,” in CVPR, 2020, pp. 6184–6193.
- A. O. Balan, L. Sigal, M. J. Black, J. E. Davis, and H. W. Haussecker, “Detailed human shape and pose from images,” in CVPR. IEEE, 2007, pp. 1–8.
- M. Loper, N. Mahmood, and M. J. Black, “MoSh: Motion and shape capture from sparse markers,” TOG, vol. 33, no. 6, pp. 1–13, 2014.
- C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu, “Human3.6M: Large scale datasets and predictive methods for 3D human sensing in natural environments,” TPAMI, vol. 36, no. 7, pp. 1325–1339, 2014.
- T. von Marcard, R. Henschel, M. J. Black, B. Rosenhahn, and G. Pons-Moll, “Recovering accurate 3D human pose in the wild using IMUs and a moving camera,” in ECCV, 2018, pp. 601–617.
- G. Varol, J. Romero, X. Martin, N. Mahmood, M. J. Black, I. Laptev, and C. Schmid, “Learning from synthetic humans,” in CVPR, 2017, pp. 109–117.
- N. Mahmood, N. Ghorbani, N. F. Troje, G. Pons-Moll, and M. J. Black, “AMASS: Archive of motion capture as surface shapes,” in ICCV, 2019, pp. 5442–5451.
- M. Kocabas, N. Athanasiou, and M. J. Black, “VIBE: Video inference for human body pose and shape estimation,” in CVPR, 2020, pp. 5253–5263.
- A. O. Bălan and M. J. Black, “The naked truth: Estimating body shape under clothing,” in ECCV. Springer, 2008, pp. 15–29.
- P. Guan, A. Weiss, A. O. Balan, and M. J. Black, “Estimating human shape and pose from a single image,” in ICCV. IEEE, 2009, pp. 1381–1388.
- N. Hasler, H. Ackermann, B. Rosenhahn, T. Thormählen, and H.-P. Seidel, “Multilinear pose and body shape estimation of dressed subjects from image sets,” in CVPR. IEEE, 2010, pp. 1823–1830.
- S. Zhou, H. Fu, L. Liu, D. Cohen-Or, and X. Han, “Parametric reshaping of human bodies in images,” TOG, vol. 29, no. 4, pp. 1–10, 2010.
- “Carnegie mellon university - cmu graphics lab - motion capture library,” http://mocap.cs.cmu.edu/, 2010.
- D. Xiang, H. Joo, and Y. Sheikh, “Monocular total capture: Posing face, body, and hands in the wild,” in CVPR, 2019, pp. 10 965–10 974.
- R. A. Güler and I. Kokkinos, “HoloPose: Holistic 3D human reconstruction in-the-wild,” in CVPR, 2019, pp. 10 884–10 894.
- R. A. Güler, N. Neverova, and I. Kokkinos, “DensePose: Dense human pose estimation in the wild,” in CVPR, 2018, pp. 7297–7306.
- J. Song, X. Chen, and O. Hilliges, “Human body model fitting by learned gradient descent,” in ECCV. Springer, 2020, pp. 744–760.
- U. Iqbal, K. Xie, Y. Guo, J. Kautz, and P. Molchanov, “KAMA: 3D keypoint aware body mesh articulation,” in 3DV. IEEE, 2021, pp. 689–699.
- Z. Yu, J. Wang, J. Xu, B. Ni, C. Zhao, M. Wang, and W. Zhang, “Skeleton2Mesh: Kinematics prior injected unsupervised human mesh recovery,” in ICCV, 2021, pp. 8619–8629.
- J. Li, C. Xu, Z. Chen, S. Bian, L. Yang, and C. Lu, “HybrIK: A hybrid analytical-neural inverse kinematics solution for 3D human pose and shape estimation,” in CVPR, 2021, pp. 3383–3393.
- J. Li, S. Bian, Q. Liu, J. Tang, F. Wang, and C. Lu, “NIKI: Neural inverse kinematics with invertible neural networks for 3D human pose and shape estimation,” in CVPR, 2023, pp. 12 933–12 942.
- K. Shetty, A. Birkhold, S. Jaganathan, N. Strobel, M. Kowarschik, A. Maier, and B. Egger, “PLIKS: A pseudo-linear inverse kinematic solver for 3D human body estimation,” in CVPR, 2023, pp. 574–584.
- H. Zhang, J. Cao, G. Lu, W. Ouyang, and Z. Sun, “DaNet: Decompose-and-aggregate network for 3D human shape and pose estimation,” in ACM MM, 2019, pp. 935–944.
- C. Rockwell and D. F. Fouhey, “Full-body awareness from partial observations,” in ECCV. Springer, 2020, pp. 522–539.
- A. Sengupta, I. Budvytis, and R. Cipolla, “Synthetic training for accurate 3D human pose and shape estimation in the wild,” in BMVC, September 2020.
- X. Xu, H. Chen, F. Moreno-Noguer, L. A. Jeni, and F. De la Torre, “3D human shape and pose from a single low-resolution image with self-supervised learning,” in ECCV. Springer, 2020, pp. 284–300.
- A. Zanfir, E. G. Bazavan, H. Xu, W. T. Freeman, R. Sukthankar, and C. Sminchisescu, “Weakly supervised 3D human pose and shape reconstruction with normalizing flows,” in ECCV. Springer, 2020, pp. 465–481.
- M. Kocabas, C.-H. P. Huang, J. Tesch, L. Muller, O. Hilliges, and M. J. Black, “SPEC: Seeing people in the wild with an estimated camera,” in ICCV, 2021, pp. 11 035–11 045.
- Z. Li, J. Liu, Z. Zhang, S. Xu, and Y. Yan, “CLIFF: Carrying location information in full frames into human pose and shape estimation,” in ECCV. Springer, 2022, pp. 590–606.
- A. Kanazawa, J. Y. Zhang, P. Felsen, and J. Malik, “Learning 3D human dynamics from video,” in CVPR, 2019, pp. 5614–5623.
- Y. Xu, S.-C. Zhu, and T. Tung, “DenseRaC: Joint 3D pose and shape estimation by dense render-and-compare,” in ICCV, 2019, pp. 7760–7770.
- Y. Sun, Y. Ye, W. Liu, W. Gao, Y. Fu, and T. Mei, “Human mesh recovery from monocular images via a skeleton-disentangled representation,” in ICCV, 2019, pp. 5349–5358.
- G. Georgakis, R. Li, S. Karanam, T. Chen, J. Košecká, and Z. Wu, “Hierarchical kinematic human mesh recovery,” in ECCV. Springer, 2020, pp. 768–784.
- Y. Zhou, C. Barnes, J. Lu, J. Yang, and H. Li, “On the continuity of rotation representations in neural networks,” in CVPR, 2019, pp. 5745–5753.
- H. Zhang, J. Cao, G. Lu, W. Ouyang, and Z. Sun, “Learning 3D human shape and pose from dense body parts,” TPAMI, 2020.
- Z. Luo, S. A. Golestaneh, and K. M. Kitani, “3D human motion estimation via motion compression and refinement,” in ACCV, 2020.
- Y. Zhou, M. Habermann, I. Habibie, A. Tewari, C. Theobalt, and F. Xu, “Monocular real-time full body capture with inter-part correlations,” in CVPR, 2021, pp. 4811–4822.
- H. Choi, G. Moon, J. Y. Chang, and K. M. Lee, “Beyond static features for temporally consistent 3D human pose and shape from a video,” in CVPR, 2021, pp. 1964–1973.
- D. Wang and S. Zhang, “3D human mesh recovery with sequentially global rotation estimation,” in ICCV, 2023, pp. 14 953–14 962.
- G. Varol, D. Ceylan, B. Russell, J. Yang, E. Yumer, I. Laptev, and C. Schmid, “BodyNet: Volumetric inference of 3D human body shapes,” in ECCV, 2018, pp. 20–36.
- Z. Zheng, T. Yu, Y. Wei, Q. Dai, and Y. Liu, “DeepHuman: 3D human reconstruction from a single image,” in ICCV, 2019, pp. 7738–7748.
- N. Kolotouros, G. Pavlakos, and K. Daniilidis, “Convolutional mesh regression for single-image human shape reconstruction,” in CVPR, 2019, pp. 4501–4510.
- G. Moon and K. M. Lee, “I2L-MeshNet: Image-to-lixel prediction network for accurate 3D human pose and mesh estimation from a single RGB image,” in ECCV. Springer, 2020, pp. 752–768.
- K. Lin, L. Wang, and Z. Liu, “Mesh Graphormer,” in ICCV, 2021.
- ——, “End-to-end human pose and mesh reconstruction with transformers,” in CVPR, 2021, pp. 1954–1963.
- T. Luan, Y. Wang, J. Zhang, Z. Wang, Z. Zhou, and Y. Qiao, “PC-HMR: Pose calibration for 3D human mesh recovery from 2D images/videos,” in AAAI, vol. 35, no. 3, 2021, pp. 2269–2276.
- M. Zanfir, A. Zanfir, E. G. Bazavan, W. T. Freeman, R. Sukthankar, and C. Sminchisescu, “THUNDR: Transformer-based 3D human reconstruction with markers,” in CVPR, 2021.
- P. Yao, Z. Fang, F. Wu, Y. Feng, and J. Li, “DenseBody: Directly regressing dense 3D human pose and shape from a single color image,” arXiv preprint arXiv:1903.10153, 2019.
- W. Zeng, W. Ouyang, P. Luo, W. Liu, and X. Wang, “3D human mesh regression with dense correspondence,” in CVPR, 2020, pp. 7054–7063.
- T. Zhang, B. Huang, and Y. Wang, “Object-occluded human shape and pose estimation from a single color image,” in CVPR, 2020, pp. 7376–7385.
- B. Biggs, D. Novotny, S. Ehrhardt, H. Joo, B. Graham, and A. Vedaldi, “3D multi-bodies: Fitting sets of plausible 3D human models to ambiguous image data,” NeurIPS, vol. 33, 2020.
- A. Sengupta, I. Budvytis, and R. Cipolla, “Probabilistic 3D human shape and pose estimation from multiple unconstrained images in the wild,” in CVPR, 2021, pp. 16 094–16 104.
- N. Kolotouros, G. Pavlakos, D. Jayaraman, and K. Daniilidis, “Probabilistic modeling for human mesh recovery,” in ICCV, 2021, pp. 11 605–11 614.
- A. Sengupta, I. Budvytis, and R. Cipolla, “Hierarchical kinematic probability distributions for 3D human shape and pose estimation from images in the wild,” in ICCV, 2021, pp. 11 219–11 229.
- Q. Fang, K. Chen, Y. Fan, Q. Shuai, J. Li, and W. Zhang, “Learning analytical posterior probability for human mesh recovery,” in CVPR, 2023, pp. 8781–8791.
- A. Sengupta, I. Budvytis, and R. Cipolla, “HuManiFlow: Ancestor-conditioned normalising flows on SO (3) manifolds for human pose and shape distribution estimation,” in CVPR, 2023, pp. 4779–4789.
- N. Rueegg, C. Lassner, M. Black, and K. Schindler, “Chained representation cycling: Learning to estimate 3D human pose and shape by cycling between representations,” in AAAI, vol. 34, 2020, pp. 5561–5569.
- A. Zanfir, E. G. Bazavan, M. Zanfir, W. T. Freeman, R. Sukthankar, and C. Sminchisescu, “Neural descent for visual 3D human pose and shape,” in CVPR, 2021, pp. 14 484–14 493.
- C. Doersch and A. Zisserman, “Sim2real transfer learning for 3D human pose estimation: motion to the rescue,” NeurIPS, vol. 32, pp. 12 949–12 961, 2019.
- H. Choi, G. Moon, and K. M. Lee, “Pose2Mesh: Graph convolutional network for 3D human pose and mesh recovery from a 2D human pose,” in ECCV, 2020.
- Z. Li, B. Xu, H. Huang, C. Lu, and Y. Guo, “Deep two-stream video inference for human body pose and shape estimation,” in WACV, 2022, pp. 430–439.
- X. Gong, M. Zheng, B. Planche, S. Karanam, T. Chen, D. Doermann, and Z. Wu, “Self-supervised human mesh recovery with cross-representation alignment,” in ECCV. Springer, 2022, pp. 212–230.
- H. Choi, G. Moon, J. Park, and K. M. Lee, “Learning to estimate robust 3D human mesh from in-the-wild crowded scenes,” in CVPR, 2022.
- G. Pavlakos, J. Malik, and A. Kanazawa, “Human mesh recovery from multiple shots,” in CVPR, 2022, pp. 1485–1495.
- Y. Rong, Z. Liu, C. Li, K. Cao, and C. C. Loy, “Delving deep into hybrid annotations for 3D human recovery in the wild,” in ICCV, 2019, pp. 5340–5348.
- S. K. Dwivedi, N. Athanasiou, M. Kocabas, and M. J. Black, “Learning to regress bodies from images using differentiable semantic rendering,” in ICCV, 2021, pp. 11 250–11 259.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778.
- J. Wang, K. Sun, T. Cheng, B. Jiang, C. Deng, Y. Zhao, D. Liu, Y. Mu, M. Tan, X. Wang et al., “Deep high-resolution representation learning for visual recognition,” TPAMI, 2020.
- H. Cho, Y. Cho, J. Ahn, and J. Kim, “Implicit 3D human mesh recovery using consistency with pose and shape from unseen-view,” in CVPR, 2023, pp. 21 148–21 158.
- S. Goel, G. Pavlakos, J. Rajasegaran, A. Kanazawa, and J. Malik, “Humans in 4D: Reconstructing and tracking humans with transformers,” in ICCV, 2023.
- A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in ICLR, 2021.
- J. Cho, K. Youwang, and T.-H. Oh, “Cross-attention of disentangled modalities for 3D human mesh recovery with transformers,” in ECCV. Springer, 2022, pp. 342–359.
- C. Zheng, X. Liu, G.-J. Qi, and C. Chen, “POTTER: Pooling attention transformer for efficient human mesh recovery,” in CVPR, 2023, pp. 1611–1620.
- Z. Dou, Q. Wu, C. Lin, Z. Cao, Q. Wu, W. Wan, T. Komura, and W. Wang, “TORE: Token reduction for efficient human mesh recovery with transformer,” in ICCV, 2023, pp. 15 143–15 155.
- J. Kim, M.-G. Gwon, H. Park, H. Kwon, G.-M. Um, and W. Kim, “Sampling is matter: Point-guided 3D human mesh reconstruction,” in CVPR, 2023, pp. 12 880–12 889.
- Y. Yoshiyasu, “Deformable mesh transformer for 3D human mesh recovery,” in CVPR, 2023, pp. 17 006–17 015.
- J. Li, Z. Yang, X. Wang, J. Ma, C. Zhou, and Y. Yang, “JOTR: 3D joint contrastive learning with transformers for occluded human mesh recovery,” in ICCV, 2023, pp. 9110–9121.
- V. Choutas, L. Müller, C.-H. P. Huang, S. Tang, D. Tzionas, and M. J. Black, “Accurate 3D body shape regression using metric and semantic attributes,” in CVPR, 2022, pp. 2718–2728.
- X. Ma, J. Su, C. Wang, W. Zhu, and Y. Wang, “3D human mesh estimation from virtual markers,” in CVPR, 2023, pp. 534–543.
- T. Fan, K. V. Alwala, D. Xiang, W. Xu, T. Murphey, and M. Mukadam, “Revitalizing optimization for 3D human pose and shape estimation: A sparse constrained formulation,” in ICCV, 2021.
- H. Zhang, Y. Tian, Y. Zhang, M. Li, L. An, Z. Sun, and Y. Liu, “PyMAF-X: Towards well-aligned full-body model regression from monocular images,” TPAMI, 2023.
- Z. Wang, J. Yang, and C. Fowlkes, “The best of both worlds: combining model-based and nonparametric approaches for 3D human body estimation,” in CVPRW, 2022, pp. 2318–2327.
- W. Jiang, N. Kolotouros, G. Pavlakos, X. Zhou, and K. Daniilidis, “Coherent reconstruction of multiple humans from a single image,” in CVPR, 2020, pp. 5579–5588.
- N. Ugrinovic, A. Ruiz, A. Agudo, A. Sanfeliu, and F. Moreno-Noguer, “Body size and depth disambiguation in multi-person reconstruction from single images,” in 3DV. IEEE, 2021, pp. 53–63.
- M. Fieraru, M. Zanfir, T. Szente, E. Bazavan, V. Olaru, and C. Sminchisescu, “REMIPS: Physically consistent 3D reconstruction of multiple interacting people under weak supervision,” NeurIPS, vol. 34, pp. 19 385–19 397, 2021.
- J. Cha, M. Saqlain, G. Kim, M. Shin, and S. Baek, “Multi-person 3D pose and shape estimation via inverse kinematics and refinement,” in ECCV. Springer, 2022, pp. 660–677.
- R. Khirodkar, S. Tripathi, and K. Kitani, “Occluded human mesh recovery,” in CVPR, 2022, pp. 1715–1725.
- A. Zanfir, E. Marinoiu, M. Zanfir, A.-I. Popa, and C. Sminchisescu, “Deep network for the integrated 3D sensing of multiple people in natural images,” in NeurIPS, vol. 31, 2018, pp. 8410–8419.
- Y. Sun, Q. Bao, W. Liu, Y. Fu, M. J. Black, and T. Mei, “Monocular, one-stage, regression of multiple 3D people,” in ICCV, 2021, pp. 11 179–11 188.
- Y. Sun, W. Liu, Q. Bao, Y. Fu, T. Mei, and M. J. Black, “Putting people in their place: Monocular regression of 3D people in depth,” in CVPR, 2021.
- Z. Qiu, Q. Yang, J. Wang, H. Feng, J. Han, E. Ding, C. Xu, D. Fu, and J. Wang, “PSVT: End-to-end multi-person 3D pose and shape estimation with progressive video transformers,” in CVPR, 2023, pp. 21 254–21 263.
- A. Arnab, C. Doersch, and A. Zisserman, “Exploiting temporal context for 3D human pose estimation in the wild,” in CVPR, 2019, pp. 3395–3404.
- G.-H. Lee and S.-W. Lee, “Uncertainty-aware human mesh recovery from video by learning part-based 3D dynamics,” in ICCV, 2021, pp. 12 375–12 384.
- Z. Wan, Z. Li, M. Tian, J. Liu, S. Yi, and H. Li, “Encoder-decoder with multi-level attention for 3D human shape and pose estimation,” in ICCV, 2021, pp. 13 033–13 042.
- Y. Yuan, S.-E. Wei, T. Simon, K. Kitani, and J. Saragih, “SimPoE: Simulated character control for 3D human pose estimation,” in CVPR, 2021, pp. 7159–7169.
- W.-L. Wei, J.-C. Lin, T.-L. Liu, and H.-Y. M. Liao, “Capturing humans in motion: temporal-attentive 3D human pose and shape estimation from monocular video,” in CVPR, 2022, pp. 13 211–13 220.
- J. Rajasegaran, G. Pavlakos, A. Kanazawa, and J. Malik, “Tracking people with 3D representations,” in NeurIPS, 2021.
- Y. Yuan, U. Iqbal, P. Molchanov, K. Kitani, and J. Kautz, “GLAMR: Global occlusion-aware human mesh recovery with dynamic cameras,” in CVPR, 2022, pp. 11 038–11 049.
- L. Sigal, A. Balan, and M. Black, “Combined discriminative and generative articulated pose and non-rigid shape estimation,” NeurIPS, vol. 20, pp. 1337–1344, 2007.
- M. Hassan, V. Choutas, D. Tzionas, and M. J. Black, “Resolving 3D human pose ambiguities with 3D scene constraints,” in ICCV, 2019, pp. 2282–2292.
- M. Shi, K. Aberman, A. Aristidou, T. Komura, D. Lischinski, D. Cohen-Or, and B. Chen, “MotioNet: 3D human motion reconstruction from monocular video with skeleton consistency,” TOG, vol. 40, no. 1, pp. 1–15, 2020.
- S. Zhang, Y. Zhang, F. Bogo, M. Pollefeys, and S. Tang, “Learning motion priors for 2D human body capture in 3D scenes,” in ICCV, 2021, pp. 11 343–11 353.
- L. Müller, A. A. Osman, S. Tang, C.-H. P. Huang, and M. J. Black, “On self-contact and human pose,” in CVPR, 2021, pp. 9990–9999.
- D. Rempe, L. J. Guibas, A. Hertzmann, B. Russell, R. Villegas, and J. Yang, “Contact and human dynamics from monocular video,” in ECCV. Springer, 2020, pp. 71–87.
- D. Rempe, T. Birdal, A. Hertzmann, J. Yang, S. Sridhar, and L. J. Guibas, “HuMoR: 3D human motion model for robust pose estimation,” in ICCV, 2021.
- I. Akhter and M. J. Black, “Pose-conditioned joint angle limits for 3D human pose reconstruction,” in CVPR, 2015, pp. 1446–1455.
- J. Zhang, D. Yu, J. H. Liew, X. Nie, and J. Feng, “Body meshes as points,” in CVPR, 2021, pp. 546–556.
- S. Baek, K. I. Kim, and T.-K. Kim, “Pushing the envelope for RGB-based dense 3D hand pose estimation via neural rendering,” in CVPR, 2019, pp. 1067–1076.
- A. Boukhayma, R. d. Bem, and P. H. Torr, “3D hand shape and pose from images in the wild,” in CVPR, 2019, pp. 10 843–10 852.
- S. Hampali, M. Rad, M. Oberweger, and V. Lepetit, “HOnnotate: A method for 3D annotation of hand and object poses,” in CVPR, 2020, pp. 3193–3203.
- Y. Hasson, G. Varol, D. Tzionas, I. Kalevatykh, M. J. Black, I. Laptev, and C. Schmid, “Learning joint reconstruction of hands and manipulated objects,” in CVPR, 2019, pp. 11 807–11 816.
- U. Iqbal, P. Molchanov, T. Breuel, J. Gall, and J. Kautz, “Hand pose estimation via latent 2.5D heatmap regression,” in ECCV, 2018, pp. 125–143.
- D. Kulon, H. Wang, R. A. Güler, M. M. Bronstein, and S. Zafeiriou, “Single image 3D hand reconstruction with mesh convolutions,” in BMVC, 2019.
- F. Mueller, F. Bernard, O. Sotnychenko, D. Mehta, S. Sridhar, D. Casas, and C. Theobalt, “GANerated hands for real-time 3D hand tracking from monocular RGB,” in CVPR, 2018, pp. 49–59.
- B. Tekin, F. Bogo, and M. Pollefeys, “H+O: Unified egocentric recognition of 3D hand-object poses and interactions,” in CVPR, 2019, pp. 4511–4520.
- C. Zimmermann and T. Brox, “Learning to estimate 3D hand pose from single RGB images,” in ICCV, 2017, pp. 4913–4921.
- X. Zhang, Q. Li, H. Mo, W. Zhang, and W. Zheng, “End-to-end hand mesh recovery from a monocular RGB image,” in ICCV, 2019, pp. 2354–2364.
- L. Ge, Z. Ren, Y. Li, Z. Xue, Y. Wang, J. Cai, and J. Yuan, “3D hand shape and pose estimation from a single RGB image,” in CVPR, 2019, pp. 10 833–10 842.
- D. Kulon, R. A. Güler, I. Kokkinos, M. M. Bronstein, and S. Zafeiriou, “Weakly-supervised mesh-convolutional hand reconstruction in the wild,” in CVPR, 2020, pp. 4989–4999.
- J. Park, Y. Oh, G. Moon, H. Choi, and K. M. Lee, “HandOccNet: Occlusion-robust 3D hand mesh estimation network,” in CVPR, 2022, pp. 1496–1505.
- G. Moon, S.-I. Yu, H. Wen, T. Shiratori, and K. M. Lee, “InterHand2.6M: A dataset and baseline for 3D interacting hand pose estimation from a single RGB image,” in ECCV. Springer, 2020, pp. 548–564.
- J. Wang, F. Mueller, F. Bernard, S. Sorli, O. Sotnychenko, N. Qian, M. A. Otaduy, D. Casas, and C. Theobalt, “RGB2Hands: real-time tracking of 3D hand interactions from monocular RGB video,” ACM TOG, vol. 39, no. 6, pp. 1–16, 2020.
- B. Zhang, Y. Wang, X. Deng, Y. Zhang, P. Tan, C. Ma, and H. Wang, “Interacting two-hand 3D pose and shape reconstruction from single color image,” in ICCV, 2021, pp. 11 354–11 363.
- M. Li, L. An, H. Zhang, L. Wu, F. Chen, T. Yu, and Y. Liu, “Interacting attention graph for single image two-hand reconstruction,” in CVPR, 2022.
- C. Wang, F. Zhu, and S. Wen, “MeMaHand: Exploiting mesh-mano interaction for single image two-hand reconstruction,” in CVPR, 2023, pp. 564–573.
- J. Lee, M. Sung, H. Choi, and T.-K. Kim, “Im2Hands: Learning attentive implicit representation of interacting two-hand shapes,” in CVPR, 2023, pp. 21 169–21 178.
- Z. Yu, S. Huang, C. Fang, T. P. Breckon, and J. Wang, “ACR: Attention collaboration-based regressor for arbitrary two-hand reconstruction,” in CVPR, 2023, pp. 12 955–12 964.
- G. Moon, “Bringing inputs to shared domains for 3D interacting hands recovery in the wild,” in CVPR, 2023, pp. 17 028–17 037.
- T. Chatzis, A. Stergioulas, D. Konstantinidis, K. Dimitropoulos, and P. Daras, “A comprehensive study on deep learning-based 3D hand pose estimation methods,” Applied Sciences, vol. 10, no. 19, p. 6850, 2020.
- L. Huang, B. Zhang, Z. Guo, Y. Xiao, Z. Cao, and J. Yuan, “Survey on depth and RGB image-based 3D hand shape and pose estimation,” Virtual Reality & Intelligent Hardware, vol. 3, no. 3, pp. 207–234, 2021.
- O. Aldrian and W. A. Smith, “Inverse rendering of faces with a 3D morphable model,” TPAMI, vol. 35, no. 5, pp. 1080–1093, 2013.
- T. Vetter and V. Blanz, “Estimating coloured 3D face models from single images: An example based approach,” in ECCV, 1998, pp. 499–513.
- J. Thies, M. Zollhöfer, M. Stamminger, C. Theobalt, and M. Nießner, “Face2Face: Real-time face capture and reenactment of RGB videos,” in CVPR, 2016, pp. 2387–2395.
- Y. Feng, F. Wu, X. Shao, Y. Wang, and X. Zhou, “Joint 3D face reconstruction and dense alignment with position map regression network,” in ECCV, 2018, pp. 557–574.
- A. S. Jackson, A. Bulat, V. Argyriou, and G. Tzimiropoulos, “Large pose 3D face reconstruction from a single image via direct volumetric CNN regression,” in ICCV, 2017, pp. 1031–1039.
- S. Sanyal, T. Bolkart, H. Feng, and M. J. Black, “Learning to regress 3D face shape and expression from an image without 3D supervision,” in CVPR, 2019, pp. 7763–7772.
- A. Tewari, M. Zollhöfer, P. Garrido, F. Bernard, H. Kim, P. Pérez, and C. Theobalt, “Self-supervised multi-level face model learning for monocular reconstruction at over 250 Hz,” in CVPR, 2018, pp. 2549–2559.
- A. Tewari, M. Zollhöfer, H. Kim, P. Garrido, F. Bernard, P. Perez, and C. Theobalt, “MoFA: model-based deep convolutional face autoencoder for unsupervised monocular reconstruction,” in ICCV, 2017, pp. 3735–3744.
- Y. Deng, J. Yang, S. Xu, D. Chen, Y. Jia, and X. Tong, “Accurate 3D face reconstruction with weakly-supervised learning: From single image to image set,” in CVPRW, 2019, pp. 285–295.
- L. Tran, F. Liu, and X. Liu, “Towards high-fidelity nonlinear 3D face morphable model,” in CVPR, 2019, pp. 1126–1135.
- Y. Feng, H. Feng, M. J. Black, and T. Bolkart, “Learning an animatable detailed 3D face model from in-the-wild images,” TOG, vol. 40, no. 4, pp. 88:1–88:13, 2021.
- L. Wang, Z. Chen, T. Yu, C. Ma, L. Li, and Y. Liu, “FaceVerse: a fine-grained and detail-controllable 3D face morphable model from a hybrid dataset,” in CVPR, 2022.
- W. Zielonka, T. Bolkart, and J. Thies, “Towards metrical reconstruction of human faces,” in ECCV. Springer, 2022, pp. 250–269.
- M. M. Loper and M. J. Black, “OpenDR: An approximate differentiable renderer,” in ECCV, 2014, pp. 154–169.
- N. Ravi, J. Reizenstein, D. Novotny, T. Gordon, W.-Y. Lo, J. Johnson, and G. Gkioxari, “Accelerating 3D deep learning with PyTorch3D,” arXiv preprint arXiv:2007.08501, 2020.
- K. Genova, F. Cole, A. Maschinot, A. Sarna, D. Vlasic, and W. T. Freeman, “Unsupervised training for 3D morphable model regression,” in CVPR, 2018, pp. 8377–8386.
- Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman, “VGGFace2: A dataset for recognising faces across pose and age,” in FG, 2018, pp. 67–74.
- B. Egger, W. A. P. Smith, A. Tewari, S. Wuhrer, M. Zollhoefer, T. Beeler, F. Bernard, T. Bolkart, A. Kortylewski, S. Romdhani, C. Theobalt, V. Blanz, and T. Vetter, “3D morphable face models - past, present and future,” TOG, vol. 39, no. 5, pp. 157:1–157:38, 2020.
- H. Yi, H. Liang, Y. Liu, Q. Cao, Y. Wen, T. Bolkart, D. Tao, and M. J. Black, “Generating holistic 3D human motion from speech,” in CVPR, 2023, pp. 469–480.
- N. Zioulis and J. F. O’Brien, “KBody: Towards general, robust, and aligned monocular whole-body estimation,” in CVPRW, 2023, pp. 6214–6224.
- T. Simon, H. Joo, I. Matthews, and Y. Sheikh, “Hand keypoint detection in single images using multiview bootstrapping,” in CVPR, 2017, pp. 4645–4653.
- Y. Rong, T. Shiratori, and H. Joo, “FrankMocap: A monocular 3D whole-body pose estimation system via regression and integration,” in ICCVW, 2021.
- J. Li, S. Bian, C. Xu, Z. Chen, L. Yang, and C. Lu, “HybrIK-X: Hybrid analytical-neural inverse kinematics for whole-body mesh recovery,” arXiv preprint arXiv:2304.05690, 2023.
- Y. Sun, T. Huang, Q. Bao, W. Liu, G. Wenpeng, and Y. Fu, “Learning monocular mesh recovery of multiple body parts via synthesis,” in ICASSP, 2022.
- J. Lin, A. Zeng, H. Wang, L. Zhang, and Y. Li, “One-stage 3D whole-body mesh recovery with component aware transformer,” in CVPR, 2023, pp. 21 159–21 168.
- Z. Cai, W. Yin, A. Zeng, C. Wei, Q. Sun, Y. Wang, H. E. Pang, H. Mei, M. Zhang, L. Zhang, C. C. Loy, L. Yang, and Z. Liu, “SMPLer-X: Scaling up expressive human pose and shape estimation,” in NeurIPS Datasets and Benchmarks, 2023.
- H. E. Pang, Z. Cai, L. Yang, T. Qingyi, W. Zhonghua, T. Zhang, and Z. Liu, “Towards robust and expressive whole-body human pose and shape estimation,” NeurIPS, 2023.
- M.-P. Forte, P. Kulits, C.-H. P. Huang, V. Choutas, D. Tzionas, K. J. Kuchenbecker, and M. J. Black, “Reconstructing signing avatars from video using linguistic priors,” in CVPR, 2023, pp. 12 791–12 801.
- J. Redmon and A. Farhadi, “YOLOv3: An incremental improvement,” arXiv preprint arXiv:1804.02767, 2018.
- K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in ICCV, 2017, pp. 2961–2969.
- H. Wen, J. Huang, H. Cui, H. Lin, Y.-K. Lai, L. Fang, and K. Li, “Crowd3D: Towards hundreds of people reconstruction from a single image,” in CVPR, 2023, pp. 8937–8946.
- B. Zhang, K. Ma, S. Wu, and Z. Yuan, “Two-stage co-segmentation network based on discriminative representation for recovering human mesh from videos,” in CVPR, 2023, pp. 5662–5670.
- X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” in CVPR, 2018, pp. 7794–7803.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in NeurIPS, 2017, pp. 5998–6008.
- X. Shen, Z. Yang, X. Wang, J. Ma, C. Zhou, and Y. Yang, “Global-to-local modeling for video-based 3D human pose and shape estimation,” in CVPR, 2023, pp. 8887–8896.
- H. Cho, J. Ahn, Y. Cho, and J. Kim, “Video inference for human mesh recovery with vision transformer,” in FG, 2023, pp. 1–6.
- S. Guan, J. Xu, Y. Wang, B. Ni, and X. Yang, “Bilevel online adaptation for out-of-domain human mesh reconstruction,” in CVPR, 2021, pp. 10 472–10 481.
- H. Nam, D. S. Jung, Y. Oh, and K. M. Lee, “Cyclic test-time adaptation on monocular video for 3D human mesh reconstruction,” in ICCV, 2023, pp. 14 829–14 839.
- S. Tripathi, S. Ranade, A. Tyagi, and A. Agrawal, “PoseNet3D: Learning temporally consistent 3D human pose via knowledge distillation,” in 3DV. IEEE, 2020, pp. 311–321.
- V. Ye, G. Pavlakos, J. Malik, and A. Kanazawa, “Decoupling human and camera motion from videos in the wild,” in CVPR, 2023, pp. 21 222–21 232.
- Y. Sun, Q. Bao, W. Liu, T. Mei, and M. J. Black, “TRACE: 5D temporal regression of avatars with dynamic cameras in 3D environments,” in CVPR, 2023, pp. 8856–8866.
- J. Li, S. Bian, C. Xu, G. Liu, G. Yu, and C. Lu, “D &D: Learning human dynamics from dynamic camera,” in ECCV. Springer, 2022, pp. 479–496.
- Z. Weng and S. Yeung, “Holistic 3D human and scene mesh estimation from single view images,” in CVPR, 2021, pp. 334–343.
- J. Y. Zhang, S. Pepose, H. Joo, D. Ramanan, J. Malik, and A. Kanazawa, “Perceiving 3D human-object spatial arrangements from a single image in the wild,” in ECCV. Springer, 2020, pp. 34–51.
- X. Xie, B. L. Bhatnagar, and G. Pons-Moll, “CHORE: Contact, human and object reconstruction from a single RGB image,” in ECCV. Springer, 2022, pp. 125–145.
- H. Yi, C.-H. P. Huang, D. Tzionas, M. Kocabas, M. Hassan, S. Tang, J. Thies, and M. J. Black, “Human-aware object placement for visual environment reconstruction,” in CVPR, 2022, pp. 3959–3970.
- Z. Luo, S. Iwase, Y. Yuan, and K. M. Kitani, “Embodied scene-aware human pose estimation,” in NeurIPS, 2022.
- Z. Shen, Z. Cen, S. Peng, Q. Shuai, H. Bao, and X. Zhou, “Learning human mesh recovery in 3D scenes,” in CVPR, 2023, pp. 17 038–17 047.
- I. Kissos, L. Fritz, M. Goldman, O. Meir, E. Oks, and M. Kliger, “Beyond weak perspective for monocular 3D human pose estimation,” in ECCV. Springer, 2020, pp. 541–554.
- H. Cho, Y. Cho, J. Yu, and J. Kim, “Camera distortion-aware 3D human pose estimation in video with optimization-based meta-learning,” in ICCV, 2021, pp. 11 169–11 178.
- W. Wang, Y. Ge, H. Mei, Z. Cai, Q. Sun, Y. Wang, C. Shen, L. Yang, and T. Komura, “Zolly: Zoom focal length correctly for perspective-distorted human mesh reconstruction,” in ICCV, 2023, pp. 3925–3935.
- E. Gärtner, M. Andriluka, E. Coumans, and C. Sminchisescu, “Differentiable dynamics for articulated 3D human motion reconstruction,” in CVPR, 2022, pp. 13 190–13 200.
- E. Gärtner, M. Andriluka, H. Xu, and C. Sminchisescu, “Trajectory optimization for physics-based reconstruction of 3D human pose from monocular video,” in CVPR, 2022, pp. 13 106–13 115.
- K. Xie, T. Wang, U. Iqbal, Y. Guo, S. Fidler, and F. Shkurti, “Physics-based human motion estimation and synthesis from videos,” in ICCV, 2021, pp. 11 532–11 541.
- B. Huang, L. Pan, Y. Yang, J. Ju, and Y. Wang, “Neural MoCon: Neural motion control for physically plausible human motion capture,” in CVPR, 2022, pp. 6417–6426.
- S. Tripathi, L. Müller, C.-H. P. Huang, O. Taheri, M. J. Black, and D. Tzionas, “3D human pose estimation via intuitive physics,” in CVPR, 2023, pp. 4713–4725.
- M. Fieraru, M. Zanfir, E. Oneata, A.-I. Popa, V. Olaru, and C. Sminchisescu, “Learning complex 3D human self-contact,” in AAAI, 2021.
- M. Teschner, S. Kimmerle, B. Heidelberger, G. Zachmann, L. Raghupathi, A. Fuhrmann, M.-P. Cani, F. Faure, N. Magnenat-Thalmann, W. Strasser et al., “Collision detection for deformable objects,” in CGF, vol. 24, no. 1. Wiley Online Library, 2005, pp. 61–81.
- Y. Zhang, H. Zhang, L. Hu, J. Zhang, H. Yi, S. Zhang, and Y. Liu, “ProxyCap: Real-time monocular full-body capture in world space via human-centric proxy-to-motion learning,” arXiv preprint arXiv:2307.01200, 2023.
- I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” NeurIPS, vol. 27, 2014.
- D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in ICLR, 2014.
- D. Rezende and S. Mohamed, “Variational inference with normalizing flows,” in International conference on machine learning. PMLR, 2015, pp. 1530–1538.
- Y. Rong, Z. Liu, and C. C. Loy, “Chasing the tail in monocular 3D human reconstruction with prototype memory,” TIP, vol. 31, pp. 2907–2919, 2022.
- A. Davydov, A. Remizova, V. Constantin, S. Honari, M. Salzmann, and P. Fua, “Adversarial parametric pose prior,” in CVPR, 2022, pp. 10 997–11 005.
- L. Dinh, J. Sohl-Dickstein, and S. Bengio, “Density estimation using real NVP,” in ICLR, 2017.
- J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” NeurIPS, vol. 33, pp. 6840–6851, 2020.
- L. G. Foo, J. Gong, H. Rahmani, and J. Liu, “Distribution-aligned diffusion for human mesh recovery,” in ICCV, 2023, pp. 9221–9232.
- H. Cho and J. Kim, “Generative approach for probabilistic human mesh recovery using diffusion models,” in ICCV Workshops, 2023, pp. 4183–4188.
- S. Zhang, Q. Ma, Y. Zhang, S. Aliakbarian, D. Cosker, and S. Tang, “Probabilistic human mesh recovery in 3d scenes from egocentric views,” in ICCV, 2023, pp. 7989–8000.
- M. Kaufmann, E. Aksan, J. Song, F. Pece, R. Ziegler, and O. Hilliges, “Convolutional autoencoders for human motion infilling,” in 3DV. IEEE, 2020, pp. 918–927.
- Y. He, A. Pang, X. Chen, H. Liang, M. Wu, Y. Ma, and L. Xu, “ChallenCap: Monocular 3D capture of challenging human performances using multi-modal references,” in CVPR, 2021, pp. 11 400–11 411.
- J. Li, R. Villegas, D. Ceylan, J. Yang, Z. Kuang, H. Li, and Y. Zhao, “Task-generic hierarchical human motion prior using VAEs,” in 3DV. IEEE, 2021, pp. 771–781.
- J. Xu, M. Wang, J. Gong, W. Liu, C. Qian, Y. Xie, and L. Ma, “Exploring versatile prior for human motion via motion frequency guidance,” in 3DV. IEEE, 2021, pp. 606–616.
- Z. Cai, M. Zhang, J. Ren, C. Wei, D. Ren, J. Li, Z. Lin, H. Zhao, S. Yi, L. Yang et al., “Playing for 3D human recovery,” arXiv preprint arXiv:2110.07588, 2021.
- P. Patel, C.-H. P. Huang, J. Tesch, D. T. Hoffmann, S. Tripathi, and M. J. Black, “AGORA: Avatars in geography optimized for regression analysis,” in CVPR, 2021, pp. 13 468–13 478.
- T. Yu, Z. Zheng, K. Guo, P. Liu, Q. Dai, and Y. Liu, “Function4D: Real-time human volumetric capture from very sparse consumer RGBD sensors,” in CVPR, 2021, pp. 5746–5756.
- E. G. Bazavan, A. Zanfir, M. Zanfir, W. T. Freeman, R. Sukthankar, and C. Sminchisescu, “HSPACE: Synthetic parametric humans animated in complex environments,” arXiv preprint arXiv:2112.12867, 2021.
- M. J. Black, P. Patel, J. Tesch, and J. Yang, “BEDLAM: A synthetic dataset of bodies exhibiting detailed lifelike animated motion,” in CVPR, 2023, pp. 8726–8737.
- M. Trumble, A. Gilbert, C. Malleson, A. Hilton, and J. Collomosse, “Total Capture: 3D human pose estimation fusing video and inertial sensors,” in BMVC, 2017, pp. 1–13.
- H. Joo, H. Liu, L. Tan, L. Gui, B. Nabbe, I. Matthews, T. Kanade, S. Nobuhara, and Y. Sheikh, “Panoptic studio: A massively multiview system for social motion capture,” in ICCV, 2015, pp. 3334–3342.
- D. Mehta, H. Rhodin, D. Casas, P. Fua, O. Sotnychenko, W. Xu, and C. Theobalt, “Monocular 3D human pose estimation in the wild using improved CNN supervision,” in 3DV. IEEE, 2017, pp. 506–516.
- D. Mehta, O. Sotnychenko, F. Mueller, W. Xu, S. Sridhar, G. Pons-Moll, and C. Theobalt, “Single-shot multi-person 3D pose estimation from monocular RGB,” in 3DV. IEEE, 2018, pp. 120–130.
- Z. Li, T. Dekel, F. Cole, R. Tucker, N. Snavely, C. Liu, and W. T. Freeman, “Learning the depths of moving people by watching frozen people,” in CVPR, 2019, pp. 4521–4530.
- V. Leroy, P. Weinzaepfel, R. Brégier, H. Combaluzier, and G. Rogez, “SMPLy benchmarking 3D human pose estimation in the wild,” in 3DV. IEEE, 2020, pp. 301–310.
- Q. Fang, Q. Shuai, J. Dong, H. Bao, and X. Zhou, “Reconstructing 3D human pose by watching humans in the mirror,” in CVPR, 2021, pp. 12 814–12 823.
- Z. Yu, J. S. Yoon, I. K. Lee, P. Venkatesh, J. Park, J. Yu, and H. S. Park, “HUMBI: A large multiview dataset of human body expressions,” in CVPR, 2020, pp. 2990–3000.
- zju3dv, “EasyMoCap - make human motion capture easier.” GitHub, 2021. [Online]. Available: https://github.com/zju3dv/EasyMocap
- S. Johnson and M. Everingham, “Clustered pose and nonlinear appearance models for human pose estimation,” in BMVC, 2010, pp. 12.1–12.11.
- ——, “Learning effective human pose estimation from inaccurate annotation,” in CVPR. IEEE, 2011, pp. 1465–1472.
- T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: Common objects in context,” in ECCV. Springer, 2014, pp. 740–755.
- M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele, “2D human pose estimation: New benchmark and state of the art analysis,” in CVPR, 2014, pp. 3686–3693.
- M. Andriluka, U. Iqbal, E. Insafutdinov, L. Pishchulin, A. Milan, J. Gall, and B. Schiele, “PoseTrack: A benchmark for human pose estimation and tracking,” in CVPR, 2018, pp. 5167–5176.
- S.-H. Zhang, R. Li, X. Dong, P. Rosin, Z. Cai, X. Han, D. Yang, H. Huang, and S.-M. Hu, “Pose2Seg: Detection free human instance segmentation,” in CVPR, 2019, pp. 889–898.
- G. Moon, H. Choi, and K. M. Lee, “NeuralAnnot: Neural annotator for 3D human mesh training sets,” in CVPRW, 2022, pp. 2299–2307.
- S. Saito, Z. Huang, R. Natsume, S. Morishima, A. Kanazawa, and H. Li, “PIFu: Pixel-aligned implicit function for high-resolution clothed human digitization,” in ICCV, 2019, pp. 2304–2314.
- Z. Yang, Z. Cai, H. Mei, S. Liu, Z. Chen, W. Xiao, Y. Wei, Z. Qing, C. Wei, B. Dai, W. Wu, C. Qian, D. Lin, Z. Liu, and L. Yang, “SynBody: Synthetic dataset with layered human models for 3D human perception and modeling,” in ICCV, 2023, pp. 20 282–20 292.
- P. Paysan, R. Knothe, B. Amberg, S. Romdhani, and T. Vetter, “A 3D face model for pose and illumination invariant face recognition,” in AVSS. Ieee, 2009, pp. 296–301.
- B. L. Bhatnagar, X. Xie, I. A. Petrov, C. Sminchisescu, C. Theobalt, and G. Pons-Moll, “BEHAVE: Dataset and method for tracking human object interactions,” in CVPR, 2022, pp. 15 935–15 946.
- O. Taheri, N. Ghorbani, M. J. Black, and D. Tzionas, “GRAB: A dataset of whole-body human grasping of objects,” in ECCV. Springer, 2020, pp. 581–600.
- C.-H. P. Huang, H. Yi, M. Höschle, M. Safroshkin, T. Alexiadis, S. Polikovsky, D. Scharstein, and M. J. Black, “Capturing and inferring dense full-body human-scene contact,” in CVPR, 2022, pp. 13 274–13 285.
- Y. Dai, Y. Lin, X. Lin, C. Wen, L. Xu, H. Yi, S. Shen, Y. Ma, and C. Wang, “SLOPER4D: A scene-aware dataset for global 4D human pose estimation in urban environments,” in CVPR, 2023, pp. 682–692.
- A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, “Large-scale video classification with convolutional neural networks,” in CVPR, 2014, pp. 1725–1732.
- J. Lin, A. Zeng, S. Lu, Y. Cai, R. Zhang, H. Wang, and L. Zhang, “Motion-X: A large-scale 3D expressive whole-body human motion dataset,” NeurIPS, 2023.
- A. Zeng, L. Yang, X. Ju, J. Li, J. Wang, and Q. Xu, “SmoothNet: A plug-and-play network for refining human poses in videos,” in ECCV. Springer, 2022, pp. 625–642.
- S. Zhang, Y. Zhang, Q. Ma, M. J. Black, and S. Tang, “PLACE: Proximity learning of articulation and contact in 3D environments,” in 3DV. IEEE, 2020, pp. 642–651.
- Y. Zhang, M. Hassan, H. Neumann, M. J. Black, and S. Tang, “Generating 3D people in scenes without people,” in CVPR, 2020, pp. 6194–6204.
- M. Liu, D. Yang, Y. Zhang, Z. Cui, J. M. Rehg, and S. Tang, “4D human body capture from egocentric video via 3D scene grounding,” in 3DV. IEEE, 2021, pp. 930–939.
- C.-H. Chen, A. Tyagi, A. Agrawal, D. Drover, S. Stojanov, and J. M. Rehg, “Unsupervised 3D pose estimation with geometric self-supervision,” in CVPR, 2019, pp. 5714–5724.
- H. Rhodin, M. Salzmann, and P. Fua, “Unsupervised geometry-aware representation for 3D human pose estimation,” in ECCV, 2018, pp. 750–767.
- J. N. Kundu, M. Rakesh, V. Jampani, R. M. Venkatesh, and R. Venkatesh Babu, “Appearance consensus driven self-supervised human mesh recovery,” in ECCV. Springer, 2020, pp. 794–812.
- W.-S. Zheng, S. Gong, and T. Xiang, “Associating groups of people,” in BMVC, vol. 2, no. 6, 2009, pp. 1–11.
- G. Lisanti, N. Martinel, A. Del Bimbo, and G. Luca Foresti, “Group re-identification via unsupervised transfer of sparse features encoding,” in ICCV, 2017, pp. 2449–2458.
- W. Xu, A. Chatterjee, M. Zollhöfer, H. Rhodin, D. Mehta, H.-P. Seidel, and C. Theobalt, “MonoPerfCap: Human performance capture from monocular video,” TOG, vol. 37, no. 2, pp. 1–15, 2018.
- Q. Ma, J. Yang, A. Ranjan, S. Pujades, G. Pons-Moll, S. Tang, and M. J. Black, “Learning to dress 3D people in generative clothing,” in CVPR, 2020, pp. 6468–6477.
- H. Zhu, X. Zuo, S. Wang, X. Cao, and R. Yang, “Detailed human shape estimation from a single image by hierarchical mesh deformation,” in CVPR, 2019, pp. 4491–4500.
- Q. Ma, S. Saito, J. Yang, S. Tang, and M. J. Black, “SCALE: Modeling clothed humans with a surface codec of articulated local elements,” in CVPR, 2021, pp. 16 082–16 093.
- S. Lin, H. Zhang, Z. Zheng, R. Shao, and Y. Liu, “Learning implicit templates for point-based clothed human modeling,” in ECCV. Springer, 2022, pp. 210–228.
- H. Zhang, S. Lin, R. Shao, Y. Zhang, Z. Zheng, H. Huang, Y. Guo, and Y. Liu, “CloSET: Modeling clothed humans on continuous surface with explicit template decomposition,” in CVPR, 2023, pp. 501–511.
- Z. Li, T. Yu, C. Pan, Z. Zheng, and Y. Liu, “Robust 3D self-portraits in seconds,” in CVPR, 2020, pp. 1344–1353.
- T. Alldieck, M. Zanfir, and C. Sminchisescu, “Photorealistic monocular 3D reconstruction of humans wearing clothing,” in CVPR, 2022, pp. 1506–1515.
- B. L. Bhatnagar, C. Sminchisescu, C. Theobalt, and G. Pons-Moll, “Combining implicit function learning and parametric models for 3D human reconstruction,” in ECCV. Springer, 2020, pp. 311–329.
- R. Shao, H. Zhang, H. Zhang, M. Chen, Y. Cao, T. Yu, and Y. Liu, “DoubleField: Bridging the neural surface and radiance fields for high-fidelity human reconstruction and rendering,” in CVPR, 2022.
- Y. Feng, J. Yang, M. Pollefeys, M. J. Black, and T. Bolkart, “SCARF: Capturing and animation of body and clothing from monocular video,” in SIGGRAPH Asia Conference Papers, 2022, p. 9.
- G. Moon, H. Nam, T. Shiratori, and K. M. Lee, “3D clothed human reconstruction in the wild,” in ECCV. Springer, 2022, pp. 184–200.