XHand: Real-time Expressive Hand Avatar (2407.21002v1)
Abstract: Hand avatars play a pivotal role in a wide array of digital interfaces, enhancing user immersion and facilitating natural interaction within virtual environments. While previous studies have focused on photo-realistic hand rendering, little attention has been paid to reconstruct the hand geometry with fine details, which is essential to rendering quality. In the realms of extended reality and gaming, on-the-fly rendering becomes imperative. To this end, we introduce an expressive hand avatar, named XHand, that is designed to comprehensively generate hand shape, appearance, and deformations in real-time. To obtain fine-grained hand meshes, we make use of three feature embedding modules to predict hand deformation displacements, albedo, and linear blending skinning weights, respectively. To achieve photo-realistic hand rendering on fine-grained meshes, our method employs a mesh-based neural renderer by leveraging mesh topological consistency and latent codes from embedding modules. During training, a part-aware Laplace smoothing strategy is proposed by incorporating the distinct levels of regularization to effectively maintain the necessary details and eliminate the undesired artifacts. The experimental evaluations on InterHand2.6M and DeepHandMesh datasets demonstrate the efficacy of XHand, which is able to recover high-fidelity geometry and texture for hand animations across diverse poses in real-time. To reproduce our results, we will make the full implementation publicly available at https://github.com/agnJason/XHand.
- B. Doosti, S. Naha, M. Mirbagheri, and D. J. Crandall, “Hope-net: A graph-based model for hand-object pose estimation,” in IEEE Conf. Comput. Vis. Pattern Recog., 2020, pp. 6607–6616.
- Y. Hasson, B. Tekin, F. Bogo, I. Laptev, M. Pollefeys, and C. Schmid, “Leveraging photometric consistency over time for sparsely supervised hand-object reconstruction,” in IEEE Conf. Comput. Vis. Pattern Recog., 2020, pp. 571–580.
- H. Fan, T. Zhuo, X. Yu, Y. Yang, and M. Kankanhalli, “Understanding atomic hand-object interaction with human intention,” IEEE Trans. Circuit Syst. Video Technol., vol. 32, no. 1, pp. 275–285, 2021.
- H. Cheng, L. Yang, and Z. Liu, “Survey on 3d hand gesture recognition,” IEEE Trans. Circuit Syst. Video Technol., vol. 26, no. 9, pp. 1659–1673, 2015.
- G. Pavlakos, V. Choutas, N. Ghorbani, T. Bolkart, A. A. A. Osman, D. Tzionas, and M. J. Black, “Expressive body capture: 3d hands, face, and body from a single image,” in IEEE Conf. Comput. Vis. Pattern Recog., 2019, pp. 10 975–10 985.
- K. Karunratanakul, S. Prokudin, O. Hilliges, and S. Tang, “Harp: Personalized hand reconstruction from a monocular rgb video,” in IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 12 802–12 813.
- Y. Li, L. Zhang, Z. Qiu, Y. Jiang, N. Li, Y. Ma, Y. Zhang, L. Xu, and J. Yu, “NIMBLE: a non-rigid hand model with bones and muscles,” ACM Trans. on Graph., pp. 120:1–120:16, 2022.
- A. Mundra, J. Wang, M. Habermann, C. Theobalt, M. Elgharib et al., “Livehand: Real-time and photorealistic neural hand rendering,” in Int. Conf. Comput. Vis., 2023, pp. 18 035–18 045.
- J. Romero, D. Tzionas, and M. J. Black, “Embodied hands: Modeling and capturing hands and bodies together,” ACM Trans. on Graph., pp. 245:1–245:17, 2017.
- M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black, “SMPL: a skinned multi-person linear model,” ACM Trans. on Graph., pp. 248:1–248:16, 2015.
- Z. Cao, I. Radosavovic, A. Kanazawa, and J. Malik, “Reconstructing hand-object interactions in the wild,” in Int. Conf. Comput. Vis., 2021, pp. 12 397–12 406.
- G. M. Lim, P. Jatesiktat, and W. T. Ang, “Mobilehand: Real-time 3d hand shape and pose estimation from color image,” in International Conference on Neural Information Processing, 2020, pp. 450–459.
- T. Alldieck, H. Xu, and C. Sminchisescu, “imghum: Implicit generative models of 3d human shape and articulated pose,” in Int. Conf. Comput. Vis., 2021, pp. 5441–5450.
- J. Ren and J. Zhu, “Pyramid deep fusion network for two-hand reconstruction from rgb-d images,” IEEE Trans. Circuit Syst. Video Technol., 2024.
- S. Guo, E. Rigall, Y. Ju, and J. Dong, “3d hand pose estimation from monocular rgb with feature interaction module,” IEEE Trans. Circuit Syst. Video Technol., vol. 32, no. 8, pp. 5293–5306, 2022.
- E. Corona, T. Hodan, M. Vo, F. Moreno-Noguer, C. Sweeney, R. Newcombe, and L. Ma, “Lisa: Learning implicit shape and appearance of hands,” in IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 20 501–20 511.
- H. Choi, G. Moon, and K. M. Lee, “Pose2mesh: Graph convolutional network for 3d human pose and mesh recovery from a 2d human pose,” in Eur. Conf. Comput. Vis., 2020, pp. 769–787.
- P. Chen, Y. Chen, D. Yang, F. Wu, Q. Li, Q. Xia, and Y. Tan, “I2uv-handnet: Image-to-uv prediction network for accurate and high-fidelity 3d hand mesh modeling,” in Int. Conf. Comput. Vis., 2021, pp. 12 909–12 918.
- G. Moon, T. Shiratori, and K. M. Lee, “Deephandmesh: A weakly-supervised deep encoder-decoder framework for high-fidelity hand mesh modeling,” in Eur. Conf. Comput. Vis., 2020, pp. 440–455.
- B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,” Communications of the ACM, pp. 99–106, 2021.
- P. Wang, L. Liu, Y. Liu, C. Theobalt, T. Komura, and W. Wang, “Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction,” Adv. Neural Inform. Process. Syst., vol. 34, pp. 27 171–27 183, 2021.
- C.-Y. Weng, B. Curless, P. P. Srinivasan, J. T. Barron, and I. Kemelmacher-Shlizerman, “Humannerf: Free-viewpoint rendering of moving people from monocular video,” in IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 16 210–16 220.
- X. Chen, Y. Zheng, M. J. Black, O. Hilliges, and A. Geiger, “SNARF: differentiable forward skinning for animating non-rigid neural implicit shapes,” in Int. Conf. Comput. Vis., 2021, pp. 11 574–11 584.
- L. Liu, M. Habermann, V. Rudnev, K. Sarkar, J. Gu, and C. Theobalt, “Neural actor: Neural free-view synthesis of human actors with pose control,” ACM Trans. on Graph., pp. 1–16, 2021.
- S. Peng, Y. Zhang, Y. Xu, Q. Wang, Q. Shuai, H. Bao, and X. Zhou, “Neural body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans,” in IEEE Conf. Comput. Vis. Pattern Recog., 2021, pp. 9054–9063.
- Z. Guo, W. Zhou, M. Wang, L. Li, and H. Li, “Handnerf: Neural radiance fields for animatable interacting hands,” in IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 21 078–21 087.
- X. Chen, B. Wang, and H.-Y. Shum, “Hand avatar: Free-pose hand animation and rendering from monocular video,” in IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 8683–8693.
- G. Yang, C. Wang, N. D. Reddy, and D. Ramanan, “Reconstructing animatable categories from videos,” in IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 16 995–17 005.
- H. Luo, T. Xu, Y. Jiang, C. Zhou, Q. Qiu, Y. Zhang, W. Yang, L. Xu, and J. Yu, “Artemis: Articulated neural pets with appearance and motion synthesis,” ACM Trans. on Graph., pp. 164:1–164:19, 2022.
- S. Wu, R. Li, T. Jakab, C. Rupprecht, and A. Vedaldi, “Magicpony: Learning articulated 3d animals in the wild,” in IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 8792–8802.
- C. Cao, T. Simon, J. K. Kim, G. Schwartz, M. Zollhöfer, S. Saito, S. Lombardi, S. Wei, D. Belko, S. Yu, Y. Sheikh, and J. M. Saragih, “Authentic volumetric avatars from a phone scan,” ACM Trans. on Graph., pp. 163:1–163:19, 2022.
- Y. Zheng, W. Yifan, G. Wetzstein, M. J. Black, and O. Hilliges, “Pointavatar: Deformable point-based head avatars from videos,” in IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 21 057–21 067.
- Y. Zheng, V. F. Abrevaya, M. C. Bühler, X. Chen, M. J. Black, and O. Hilliges, “I M avatar: Implicit morphable head avatars from videos,” in IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 13 535–13 545.
- P. Grassal, M. Prinzler, T. Leistner, C. Rother, M. Nießner, and J. Thies, “Neural head avatars from monocular RGB videos,” in IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 18 632–18 643.
- X. Gao, C. Zhong, J. Xiang, Y. Hong, Y. Guo, and J. Zhang, “Reconstructing personalized semantic facial nerf models from monocular video,” ACM Trans. on Graph., pp. 200:1–200:12, 2022.
- G. Yang, M. Vo, N. Neverova, D. Ramanan, A. Vedaldi, and H. Joo, “Banmo: Building animatable 3d neural models from many casual videos,” in IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 2853–2863.
- M. Habermann, L. Liu, W. Xu, M. Zollhöfer, G. Pons-Moll, and C. Theobalt, “Real-time deep dynamic characters,” ACM Trans. on Graph., pp. 94:1–94:16, 2021.
- F. Xu, Y. Liu, C. Stoll, J. Tompkin, G. Bharaj, Q. Dai, H. Seidel, J. Kautz, and C. Theobalt, “Video-based characters: Creating new human performances from a multi-view video database,” ACM Trans. on Graph., p. 32, 2011.
- S. Peng, S. Zhang, Z. Xu, C. Geng, B. Jiang, H. Bao, and X. Zhou, “Animatable neural implicit surfaces for creating avatars from videos,” CoRR, vol. abs/2203.08133, 2022.
- B. L. Bhatnagar, C. Sminchisescu, C. Theobalt, and G. Pons-Moll, “Loopreg: Self-supervised learning of implicit surface correspondences, pose and shape for 3d human mesh registration,” in Adv. Neural Inform. Process. Syst., 2020, pp. 12 909–12 922.
- G. Moon, S.-I. Yu, H. Wen, T. Shiratori, and K. M. Lee, “Interhand2.6m: A dataset and baseline for 3d interacting hand pose estimation from a single rgb image,” in Eur. Conf. Comput. Vis., 2020, pp. 548–564.
- G. Pavlakos, D. Shan, I. Radosavovic, A. Kanazawa, D. Fouhey, and J. Malik, “Reconstructing hands in 3d with transformers,” in IEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 9826–9836.
- A. Boukhayma, R. de Bem, and P. H. Torr, “3d hand shape and pose from images in the wild,” in IEEE Conf. Comput. Vis. Pattern Recog., 2019, pp. 10 835–10 844.
- Y. Hasson, G. Varol, D. Tzionas, I. Kalevatykh, M. J. Black, I. Laptev, and C. Schmid, “Learning joint reconstruction of hands and manipulated objects,” in IEEE Conf. Comput. Vis. Pattern Recog., 2019, pp. 11 807–11 816.
- D. Kong, L. Zhang, L. Chen, H. Ma, X. Yan, S. Sun, X. Liu, K. Han, and X. Xie, “Identity-aware hand mesh estimation and personalization from rgb images,” in Eur. Conf. Comput. Vis., 2022, pp. 536–553.
- J. Ren, J. Zhu, and J. Zhang, “End-to-end weakly-supervised single-stage multiple 3d hand mesh reconstruction from a single rgb image,” Computer Vision and Image Understanding, p. 103706, 2023.
- H. Sun, X. Zheng, P. Ren, J. Wang, Q. Qi, and J. Liao, “Smr: Spatial-guided model-based regression for 3d hand pose and mesh reconstruction,” IEEE Trans. Circuit Syst. Video Technol., vol. 34, no. 1, pp. 299–314, 2023.
- M. Li, J. Wang, and N. Sang, “Latent distribution-based 3d hand pose estimation from monocular rgb images,” IEEE Trans. Circuit Syst. Video Technol., vol. 31, no. 12, pp. 4883–4894, 2021.
- M. Oren and S. K. Nayar, “Generalization of lambert’s reflectance model,” in Proc. Int. Conf. Comput. Graph. Intera. Tech., 1994, pp. 239–246.
- X. Chen, Y. Liu, Y. Dong, X. Zhang, C. Ma, Y. Xiong, Y. Zhang, and X. Guo, “Mobrecon: Mobile-friendly hand mesh reconstruction from monocular image,” in IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 20 544–20 554.
- Q. Gan, W. Li, J. Ren, and J. Zhu, “Fine-grained multi-view hand reconstruction using inverse rendering,” in AAAI, 2024.
- T. Luan, Y. Zhai, J. Meng, Z. Li, Z. Chen, Y. Xu, and J. Yuan, “High fidelity 3d hand shape reconstruction via scalable graph frequency decomposition,” in IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 16 795–16 804.
- H. Zhu, Y. Liu, J. Fan, Q. Dai, and X. Cao, “Video-based outdoor human reconstruction,” IEEE Trans. Circuit Syst. Video Technol., vol. 27, no. 4, pp. 760–770, 2016.
- K. Shen, C. Guo, M. Kaufmann, J. J. Zarate, J. Valentin, J. Song, and O. Hilliges, “X-avatar: Expressive human avatars,” in IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 16 911–16 921.
- B. K. P. Horn, “Shape from shading; a method for obtaining the shape of a smooth opaque object from one view,” Ph.D. dissertation, Massachusetts Institute of Technology, USA, 1970.
- S. Laine, J. Hellsten, T. Karras, Y. Seol, J. Lehtinen, and T. Aila, “Modular primitives for high-performance differentiable rendering,” ACM Trans. on Graph., pp. 194:1–194:14, 2020.
- K. Aliev, A. Sevastopolsky, M. Kolos, D. Ulyanov, and V. S. Lempitsky, “Neural point-based graphics,” in Eur. Conf. Comput. Vis., 2020, pp. 696–712.
- L. Lin, S. Peng, Q. Gan, and J. Zhu, “Fasthuman: Reconstructing high-quality clothed human in minutes,” in International Conference on 3D Vision, 2024.
- A. Nealen, T. Igarashi, O. Sorkine, and M. Alexa, “Laplacian mesh optimization,” in Proc. Int. Conf. Comput. Graph. Intera. Tech., 2006, pp. 381–389.
- B. Kerbl, G. Kopanas, T. Leimkühler, and G. Drettakis, “3d gaussian splatting for real-time radiance field rendering,” ACM Trans. on Graph., pp. 1–14, 2023.
- E. R. Chan, C. Z. Lin, M. A. Chan, K. Nagano, B. Pan, S. De Mello, O. Gallo, L. J. Guibas, J. Tremblay, S. Khamis et al., “Efficient geometry-aware 3d generative adversarial networks,” in IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 16 123–16 133.
- O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention, 2015, pp. 234–241.