Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

XHand: Real-time Expressive Hand Avatar (2407.21002v1)

Published 30 Jul 2024 in cs.CV and cs.AI

Abstract: Hand avatars play a pivotal role in a wide array of digital interfaces, enhancing user immersion and facilitating natural interaction within virtual environments. While previous studies have focused on photo-realistic hand rendering, little attention has been paid to reconstruct the hand geometry with fine details, which is essential to rendering quality. In the realms of extended reality and gaming, on-the-fly rendering becomes imperative. To this end, we introduce an expressive hand avatar, named XHand, that is designed to comprehensively generate hand shape, appearance, and deformations in real-time. To obtain fine-grained hand meshes, we make use of three feature embedding modules to predict hand deformation displacements, albedo, and linear blending skinning weights, respectively. To achieve photo-realistic hand rendering on fine-grained meshes, our method employs a mesh-based neural renderer by leveraging mesh topological consistency and latent codes from embedding modules. During training, a part-aware Laplace smoothing strategy is proposed by incorporating the distinct levels of regularization to effectively maintain the necessary details and eliminate the undesired artifacts. The experimental evaluations on InterHand2.6M and DeepHandMesh datasets demonstrate the efficacy of XHand, which is able to recover high-fidelity geometry and texture for hand animations across diverse poses in real-time. To reproduce our results, we will make the full implementation publicly available at https://github.com/agnJason/XHand.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (62)
  1. B. Doosti, S. Naha, M. Mirbagheri, and D. J. Crandall, “Hope-net: A graph-based model for hand-object pose estimation,” in IEEE Conf. Comput. Vis. Pattern Recog., 2020, pp. 6607–6616.
  2. Y. Hasson, B. Tekin, F. Bogo, I. Laptev, M. Pollefeys, and C. Schmid, “Leveraging photometric consistency over time for sparsely supervised hand-object reconstruction,” in IEEE Conf. Comput. Vis. Pattern Recog., 2020, pp. 571–580.
  3. H. Fan, T. Zhuo, X. Yu, Y. Yang, and M. Kankanhalli, “Understanding atomic hand-object interaction with human intention,” IEEE Trans. Circuit Syst. Video Technol., vol. 32, no. 1, pp. 275–285, 2021.
  4. H. Cheng, L. Yang, and Z. Liu, “Survey on 3d hand gesture recognition,” IEEE Trans. Circuit Syst. Video Technol., vol. 26, no. 9, pp. 1659–1673, 2015.
  5. G. Pavlakos, V. Choutas, N. Ghorbani, T. Bolkart, A. A. A. Osman, D. Tzionas, and M. J. Black, “Expressive body capture: 3d hands, face, and body from a single image,” in IEEE Conf. Comput. Vis. Pattern Recog., 2019, pp. 10 975–10 985.
  6. K. Karunratanakul, S. Prokudin, O. Hilliges, and S. Tang, “Harp: Personalized hand reconstruction from a monocular rgb video,” in IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 12 802–12 813.
  7. Y. Li, L. Zhang, Z. Qiu, Y. Jiang, N. Li, Y. Ma, Y. Zhang, L. Xu, and J. Yu, “NIMBLE: a non-rigid hand model with bones and muscles,” ACM Trans. on Graph., pp. 120:1–120:16, 2022.
  8. A. Mundra, J. Wang, M. Habermann, C. Theobalt, M. Elgharib et al., “Livehand: Real-time and photorealistic neural hand rendering,” in Int. Conf. Comput. Vis., 2023, pp. 18 035–18 045.
  9. J. Romero, D. Tzionas, and M. J. Black, “Embodied hands: Modeling and capturing hands and bodies together,” ACM Trans. on Graph., pp. 245:1–245:17, 2017.
  10. M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black, “SMPL: a skinned multi-person linear model,” ACM Trans. on Graph., pp. 248:1–248:16, 2015.
  11. Z. Cao, I. Radosavovic, A. Kanazawa, and J. Malik, “Reconstructing hand-object interactions in the wild,” in Int. Conf. Comput. Vis., 2021, pp. 12 397–12 406.
  12. G. M. Lim, P. Jatesiktat, and W. T. Ang, “Mobilehand: Real-time 3d hand shape and pose estimation from color image,” in International Conference on Neural Information Processing, 2020, pp. 450–459.
  13. T. Alldieck, H. Xu, and C. Sminchisescu, “imghum: Implicit generative models of 3d human shape and articulated pose,” in Int. Conf. Comput. Vis., 2021, pp. 5441–5450.
  14. J. Ren and J. Zhu, “Pyramid deep fusion network for two-hand reconstruction from rgb-d images,” IEEE Trans. Circuit Syst. Video Technol., 2024.
  15. S. Guo, E. Rigall, Y. Ju, and J. Dong, “3d hand pose estimation from monocular rgb with feature interaction module,” IEEE Trans. Circuit Syst. Video Technol., vol. 32, no. 8, pp. 5293–5306, 2022.
  16. E. Corona, T. Hodan, M. Vo, F. Moreno-Noguer, C. Sweeney, R. Newcombe, and L. Ma, “Lisa: Learning implicit shape and appearance of hands,” in IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 20 501–20 511.
  17. H. Choi, G. Moon, and K. M. Lee, “Pose2mesh: Graph convolutional network for 3d human pose and mesh recovery from a 2d human pose,” in Eur. Conf. Comput. Vis., 2020, pp. 769–787.
  18. P. Chen, Y. Chen, D. Yang, F. Wu, Q. Li, Q. Xia, and Y. Tan, “I2uv-handnet: Image-to-uv prediction network for accurate and high-fidelity 3d hand mesh modeling,” in Int. Conf. Comput. Vis., 2021, pp. 12 909–12 918.
  19. G. Moon, T. Shiratori, and K. M. Lee, “Deephandmesh: A weakly-supervised deep encoder-decoder framework for high-fidelity hand mesh modeling,” in Eur. Conf. Comput. Vis., 2020, pp. 440–455.
  20. B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,” Communications of the ACM, pp. 99–106, 2021.
  21. P. Wang, L. Liu, Y. Liu, C. Theobalt, T. Komura, and W. Wang, “Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction,” Adv. Neural Inform. Process. Syst., vol. 34, pp. 27 171–27 183, 2021.
  22. C.-Y. Weng, B. Curless, P. P. Srinivasan, J. T. Barron, and I. Kemelmacher-Shlizerman, “Humannerf: Free-viewpoint rendering of moving people from monocular video,” in IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 16 210–16 220.
  23. X. Chen, Y. Zheng, M. J. Black, O. Hilliges, and A. Geiger, “SNARF: differentiable forward skinning for animating non-rigid neural implicit shapes,” in Int. Conf. Comput. Vis., 2021, pp. 11 574–11 584.
  24. L. Liu, M. Habermann, V. Rudnev, K. Sarkar, J. Gu, and C. Theobalt, “Neural actor: Neural free-view synthesis of human actors with pose control,” ACM Trans. on Graph., pp. 1–16, 2021.
  25. S. Peng, Y. Zhang, Y. Xu, Q. Wang, Q. Shuai, H. Bao, and X. Zhou, “Neural body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans,” in IEEE Conf. Comput. Vis. Pattern Recog., 2021, pp. 9054–9063.
  26. Z. Guo, W. Zhou, M. Wang, L. Li, and H. Li, “Handnerf: Neural radiance fields for animatable interacting hands,” in IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 21 078–21 087.
  27. X. Chen, B. Wang, and H.-Y. Shum, “Hand avatar: Free-pose hand animation and rendering from monocular video,” in IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 8683–8693.
  28. G. Yang, C. Wang, N. D. Reddy, and D. Ramanan, “Reconstructing animatable categories from videos,” in IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 16 995–17 005.
  29. H. Luo, T. Xu, Y. Jiang, C. Zhou, Q. Qiu, Y. Zhang, W. Yang, L. Xu, and J. Yu, “Artemis: Articulated neural pets with appearance and motion synthesis,” ACM Trans. on Graph., pp. 164:1–164:19, 2022.
  30. S. Wu, R. Li, T. Jakab, C. Rupprecht, and A. Vedaldi, “Magicpony: Learning articulated 3d animals in the wild,” in IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 8792–8802.
  31. C. Cao, T. Simon, J. K. Kim, G. Schwartz, M. Zollhöfer, S. Saito, S. Lombardi, S. Wei, D. Belko, S. Yu, Y. Sheikh, and J. M. Saragih, “Authentic volumetric avatars from a phone scan,” ACM Trans. on Graph., pp. 163:1–163:19, 2022.
  32. Y. Zheng, W. Yifan, G. Wetzstein, M. J. Black, and O. Hilliges, “Pointavatar: Deformable point-based head avatars from videos,” in IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 21 057–21 067.
  33. Y. Zheng, V. F. Abrevaya, M. C. Bühler, X. Chen, M. J. Black, and O. Hilliges, “I M avatar: Implicit morphable head avatars from videos,” in IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 13 535–13 545.
  34. P. Grassal, M. Prinzler, T. Leistner, C. Rother, M. Nießner, and J. Thies, “Neural head avatars from monocular RGB videos,” in IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 18 632–18 643.
  35. X. Gao, C. Zhong, J. Xiang, Y. Hong, Y. Guo, and J. Zhang, “Reconstructing personalized semantic facial nerf models from monocular video,” ACM Trans. on Graph., pp. 200:1–200:12, 2022.
  36. G. Yang, M. Vo, N. Neverova, D. Ramanan, A. Vedaldi, and H. Joo, “Banmo: Building animatable 3d neural models from many casual videos,” in IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 2853–2863.
  37. M. Habermann, L. Liu, W. Xu, M. Zollhöfer, G. Pons-Moll, and C. Theobalt, “Real-time deep dynamic characters,” ACM Trans. on Graph., pp. 94:1–94:16, 2021.
  38. F. Xu, Y. Liu, C. Stoll, J. Tompkin, G. Bharaj, Q. Dai, H. Seidel, J. Kautz, and C. Theobalt, “Video-based characters: Creating new human performances from a multi-view video database,” ACM Trans. on Graph., p. 32, 2011.
  39. S. Peng, S. Zhang, Z. Xu, C. Geng, B. Jiang, H. Bao, and X. Zhou, “Animatable neural implicit surfaces for creating avatars from videos,” CoRR, vol. abs/2203.08133, 2022.
  40. B. L. Bhatnagar, C. Sminchisescu, C. Theobalt, and G. Pons-Moll, “Loopreg: Self-supervised learning of implicit surface correspondences, pose and shape for 3d human mesh registration,” in Adv. Neural Inform. Process. Syst., 2020, pp. 12 909–12 922.
  41. G. Moon, S.-I. Yu, H. Wen, T. Shiratori, and K. M. Lee, “Interhand2.6m: A dataset and baseline for 3d interacting hand pose estimation from a single rgb image,” in Eur. Conf. Comput. Vis., 2020, pp. 548–564.
  42. G. Pavlakos, D. Shan, I. Radosavovic, A. Kanazawa, D. Fouhey, and J. Malik, “Reconstructing hands in 3d with transformers,” in IEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 9826–9836.
  43. A. Boukhayma, R. de Bem, and P. H. Torr, “3d hand shape and pose from images in the wild,” in IEEE Conf. Comput. Vis. Pattern Recog., 2019, pp. 10 835–10 844.
  44. Y. Hasson, G. Varol, D. Tzionas, I. Kalevatykh, M. J. Black, I. Laptev, and C. Schmid, “Learning joint reconstruction of hands and manipulated objects,” in IEEE Conf. Comput. Vis. Pattern Recog., 2019, pp. 11 807–11 816.
  45. D. Kong, L. Zhang, L. Chen, H. Ma, X. Yan, S. Sun, X. Liu, K. Han, and X. Xie, “Identity-aware hand mesh estimation and personalization from rgb images,” in Eur. Conf. Comput. Vis., 2022, pp. 536–553.
  46. J. Ren, J. Zhu, and J. Zhang, “End-to-end weakly-supervised single-stage multiple 3d hand mesh reconstruction from a single rgb image,” Computer Vision and Image Understanding, p. 103706, 2023.
  47. H. Sun, X. Zheng, P. Ren, J. Wang, Q. Qi, and J. Liao, “Smr: Spatial-guided model-based regression for 3d hand pose and mesh reconstruction,” IEEE Trans. Circuit Syst. Video Technol., vol. 34, no. 1, pp. 299–314, 2023.
  48. M. Li, J. Wang, and N. Sang, “Latent distribution-based 3d hand pose estimation from monocular rgb images,” IEEE Trans. Circuit Syst. Video Technol., vol. 31, no. 12, pp. 4883–4894, 2021.
  49. M. Oren and S. K. Nayar, “Generalization of lambert’s reflectance model,” in Proc. Int. Conf. Comput. Graph. Intera. Tech., 1994, pp. 239–246.
  50. X. Chen, Y. Liu, Y. Dong, X. Zhang, C. Ma, Y. Xiong, Y. Zhang, and X. Guo, “Mobrecon: Mobile-friendly hand mesh reconstruction from monocular image,” in IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 20 544–20 554.
  51. Q. Gan, W. Li, J. Ren, and J. Zhu, “Fine-grained multi-view hand reconstruction using inverse rendering,” in AAAI, 2024.
  52. T. Luan, Y. Zhai, J. Meng, Z. Li, Z. Chen, Y. Xu, and J. Yuan, “High fidelity 3d hand shape reconstruction via scalable graph frequency decomposition,” in IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 16 795–16 804.
  53. H. Zhu, Y. Liu, J. Fan, Q. Dai, and X. Cao, “Video-based outdoor human reconstruction,” IEEE Trans. Circuit Syst. Video Technol., vol. 27, no. 4, pp. 760–770, 2016.
  54. K. Shen, C. Guo, M. Kaufmann, J. J. Zarate, J. Valentin, J. Song, and O. Hilliges, “X-avatar: Expressive human avatars,” in IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 16 911–16 921.
  55. B. K. P. Horn, “Shape from shading; a method for obtaining the shape of a smooth opaque object from one view,” Ph.D. dissertation, Massachusetts Institute of Technology, USA, 1970.
  56. S. Laine, J. Hellsten, T. Karras, Y. Seol, J. Lehtinen, and T. Aila, “Modular primitives for high-performance differentiable rendering,” ACM Trans. on Graph., pp. 194:1–194:14, 2020.
  57. K. Aliev, A. Sevastopolsky, M. Kolos, D. Ulyanov, and V. S. Lempitsky, “Neural point-based graphics,” in Eur. Conf. Comput. Vis., 2020, pp. 696–712.
  58. L. Lin, S. Peng, Q. Gan, and J. Zhu, “Fasthuman: Reconstructing high-quality clothed human in minutes,” in International Conference on 3D Vision, 2024.
  59. A. Nealen, T. Igarashi, O. Sorkine, and M. Alexa, “Laplacian mesh optimization,” in Proc. Int. Conf. Comput. Graph. Intera. Tech., 2006, pp. 381–389.
  60. B. Kerbl, G. Kopanas, T. Leimkühler, and G. Drettakis, “3d gaussian splatting for real-time radiance field rendering,” ACM Trans. on Graph., pp. 1–14, 2023.
  61. E. R. Chan, C. Z. Lin, M. A. Chan, K. Nagano, B. Pan, S. De Mello, O. Gallo, L. J. Guibas, J. Tremblay, S. Khamis et al., “Efficient geometry-aware 3d generative adversarial networks,” in IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 16 123–16 133.
  62. O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention, 2015, pp. 234–241.

Summary

  • The paper's main contribution is the XHand framework that integrates feature embedding modules and a mesh-based neural renderer to capture fine hand details in real time.
  • It employs part-aware Laplace smoothing and demonstrates state-of-the-art performance with a PSNR of 34.32 dB on InterHand2.6M and a rendering speed of 56 fps.
  • The work sets a new benchmark for lifelike hand avatars, offering enhanced visual fidelity and practical applications in VR, gaming, and telepresence.

Overview of XHand: Real-time Expressive Hand Avatar

The paper "XHand: Real-time Expressive Hand Avatar" presents a novel framework designed to achieve both highly detailed hand geometry and photorealistic rendering in real-time. The authors, Gan et al., address crucial challenges in hand avatar modeling by introducing a comprehensive methodology that combines feature embedding modules and a mesh-based neural renderer, leveraging the MANO model for hand pose and shape parameters. The proposal advances the state of the art by emphasizing fine-grained geometry while maintaining real-time rendering capabilities.

Technical Contributions

The paper's key contribution is the development of XHand, an animatable hand model that balances detail fidelity with computational efficiency. This is accomplished through a unique integration of several innovative components:

  1. Feature Embedding Modules: The authors propose three distinct feature embedding modules aimed at predicting hand deformation displacements, albedo, and linear blending skinning weights. These modules effectively separate pose-driven features from average hand mesh features, simplifying the task of capturing dynamic hand geometries and textures under various poses.
  2. Mesh-based Neural Rendering: Employing a mesh-based neural renderer, XHand bypasses the heavy computational demands of volumetric approaches, offering a streamlined approach that preserves the mesh's topological consistency. This results in significantly enhanced visual fidelity and detail preservation without compromising rendering speed.
  3. Part-aware Laplace Smoothing: To mitigate artifacts while extracting intricate mesh details, a part-aware Laplace smoothing strategy is implemented. This approach introduces hierarchical weights that adaptively regulate smoothing based on the geometric complexity and pose-specific variations, aiding in maintaining high detail reproduction.

Experimental Evaluation and Results

The evaluation of XHand is conducted on the InterHand2.6M and DeepHandMesh datasets, showing that XHand delivers superior performance in both rendering and geometry reconstruction. Quantitatively, XHand achieves state-of-the-art results, including a noteworthy PSNR of 34.32 dB on the InterHand2.6M dataset. The model demonstrates its robustness across diverse poses, and outperforms existing methods like LiveHand and HandNeRF in terms of both photorealism and computational efficiency, achieving a rendering speed of 56 frames per second.

The experiments highlight XHand's capacity to produce high-fidelity meshes with enhanced detail, validated against 3D ground truth from DeepHandMesh, where it shows a reduction in average P2S error relative to other contemporary methods.

Implications and Future Work

The implications of XHand are multifaceted, impacting various domains such as virtual reality, gaming, and telepresence, where accurate and expressive hand representations significantly enhance user experience. The proposed framework lays a foundation for future explorations into personalized hand avatar systems that can adapt to varying dynamic conditions, potentially extending beyond current applications to include more holistic human body representations.

Future work could explore the integration of advanced neural rendering techniques, such as those incorporating complex material properties or lighting conditions, to further refine the visual realism of hand models across disparate environments. Additionally, further efforts to optimize the feature embedding modules could provide an avenue for scaling these techniques to broader and more complex applications in virtual environments.

In conclusion, XHand sets a new benchmark in expressive hand avatar modeling, balancing detail, realism, and speed. It exemplifies a significant step forward in achieving lifelike digital hand representations, with promising potential for future advancements and applications in AI-driven interactive systems.

Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com