Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

HandBooster: Boosting 3D Hand-Mesh Reconstruction by Conditional Synthesis and Sampling of Hand-Object Interactions (2403.18575v1)

Published 27 Mar 2024 in cs.CV

Abstract: Reconstructing 3D hand mesh robustly from a single image is very challenging, due to the lack of diversity in existing real-world datasets. While data synthesis helps relieve the issue, the syn-to-real gap still hinders its usage. In this work, we present HandBooster, a new approach to uplift the data diversity and boost the 3D hand-mesh reconstruction performance by training a conditional generative space on hand-object interactions and purposely sampling the space to synthesize effective data samples. First, we construct versatile content-aware conditions to guide a diffusion model to produce realistic images with diverse hand appearances, poses, views, and backgrounds; favorably, accurate 3D annotations are obtained for free. Then, we design a novel condition creator based on our similarity-aware distribution sampling strategies to deliberately find novel and realistic interaction poses that are distinctive from the training set. Equipped with our method, several baselines can be significantly improved beyond the SOTA on the HO3D and DexYCB benchmarks. Our code will be released on https://github.com/hxwork/HandBooster_Pytorch.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (81)
  1. Autodesk, INC. Maya. https:/autodesk.com/maya, 2018.
  2. Pushing the envelope for RGB-based dense 3D hand pose estimation via neural rendering. In CVPR, pages 1067–1076, 2019.
  3. Weakly-supervised domain adaptation via GAN and mesh model for estimating 3D hand poses interacting objects. In CVPR, pages 6121–6131, 2020.
  4. Analytic-DPM: an analytic estimate of the optimal reverse variance in diffusion probabilistic models. arXiv preprint arXiv:2201.06503, 2022.
  5. 3D hand shape and pose from images in the wild. In CVPR, pages 10843–10852, 2019.
  6. The YCB object and model set: Towards common benchmarks for manipulation research. In ICRA, pages 510–517. IEEE, 2015.
  7. Reconstructing hand-object interactions in the wild. In ICCV, pages 12417–12426, 2021.
  8. DexYCB: A benchmark for capturing hand grasping of objects. In CVPR, pages 9044–9053, 2021.
  9. Temporal-aware self-supervised learning for 3D hand pose and mesh estimation in videos. In WACV, pages 1050–1059, 2021a.
  10. I2UV-HandNet: Image-to-UV prediction network for accurate and high-fidelity 3D hand mesh modeling. In ICCV, pages 12929–12938, 2021b.
  11. Camera-space hand mesh recovery via semantic aggregation and adaptive 2D-1D registration. In CVPR, pages 13274–13283, 2021c.
  12. MobRecon: Mobile-friendly hand mesh reconstruction from monocular image. In CVPR, pages 20544–20554, 2022.
  13. Model-based 3D hand reconstruction via self-supervised learning. In CVPR, pages 10451–10460, 2021d.
  14. Pose2Mesh: Graph convolutional network for 3D human pose and mesh recovery from a 2D human pose. In ECCV, pages 769–787, 2020.
  15. Beyond static features for temporally consistent 3D human pose and shape from a video. In CVPR, pages 1964–1973, 2021.
  16. Blender Online Community. Blender. http://www.blender.org, 2019.
  17. GANHand: Predicting human grasp affordances in multi-object scenes. In CVPR, pages 5031–5041, 2020.
  18. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780–8794, 2021.
  19. The farthest point strategy for progressive image sampling. IEEE TIP, 6(9):1305–1315, 1997.
  20. Deformer: Dynamic fusion transformer for robust hand pose estimation. arXiv preprint arXiv:2303.04991, 2023.
  21. 3D hand shape and pose estimation from a single RGB image. In CVPR, pages 10833–10842, 2019.
  22. HOnnotate: A method for 3D annotation of hand and object poses. In CVPR, pages 3196–3206, 2020.
  23. Learning joint reconstruction of hands and manipulated objects. In CVPR, pages 11807–11816, 2019.
  24. Leveraging photometric consistency over time for sparsely supervised hand-object reconstruction. In CVPR, pages 571–580, 2020.
  25. Towards unconstrained joint hand-object reconstruction from RGB videos. In 3DV, pages 659–668, 2021.
  26. GANs trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS, 2017.
  27. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  28. Denoising diffusion probabilistic models. In NeurIPS, pages 6840–6851, 2020.
  29. Simple diffusion: End-to-end diffusion for high resolution images. arXiv preprint arXiv:2301.11093, 2023.
  30. LoRA: Low-rank adaptation of large language models. In ICLR, 2022a.
  31. Hand-object interaction image generation. In NeurIPS, pages 23805–23817, 2022b.
  32. Hand pose estimation via latent 2.5D heatmap regression. In ECCV, pages 118–134, 2018.
  33. Affordpose: A large-scale dataset of hand-object interactions with affordance-driven hand pose. In ICCV, pages 14713–14724, 2023.
  34. Hand-object contact consistency reasoning for human grasps generation. In ICCV, pages 11107–11116, 2021.
  35. Low-light image enhancement with wavelet-based diffusion models. ACM TOG, 42(6):1–14, 2023a.
  36. A probabilistic attention model with occlusion-aware texture regression for 3d hand reconstruction from a single rgb image. In CVPR, pages 758–767, 2023b.
  37. Elucidating the design space of diffusion-based generative models. arXiv preprint arXiv:2206.00364, 2022.
  38. Adam: A method for stochastic optimization. In ICLR, 2015.
  39. VIBE: Video inference for human body pose and shape estimation. In CVPR, pages 5253–5263, 2020.
  40. Weakly-supervised mesh-convolutional hand reconstruction in the wild. In CVPR, pages 4990–5000, 2020.
  41. DMHomo: Learning homography with diffusion models. ACM TOG, 2024.
  42. End-to-end human pose and mesh reconstruction with transformers. In CVPR, pages 1954–1963, 2021a.
  43. Mesh graphormer. In ICCV, pages 12939–12948, 2021b.
  44. Semi-supervised 3D hand-object poses estimation with interactions in time. In CVPR, pages 14687–14697, 2021.
  45. More control for free! image synthesis with semantic diffusion guidance. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 289–299, 2023.
  46. DPM-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. arXiv preprint arXiv:2206.00927, 2022.
  47. Matthew Matl. Pyrender. https://github.com/mmatl/pyrender, 2019.
  48. Occupancy networks: Learning 3D reconstruction in function space. In CVPR, pages 4460–4470, 2019.
  49. Graspit! a versatile simulator for robotic grasping. IEEE Robotics & Automation Magazine, 11(4):110–122, 2004.
  50. I2L-MeshNet: Image-to-lixel prediction network for accurate 3D human pose and mesh estimation from a single RGB image. In ECCV, pages 752–768, 2020.
  51. Interhand2.6M: A dataset and baseline for 3D interacting hand pose estimation from a single RGB image. In ECCV, pages 548–564, 2020.
  52. HandOccNet: Occlusion-robust 3D hand mesh estimation network. In CVPR, pages 1496–1505, 2022.
  53. High-resolution image synthesis with latent diffusion models. In CVPR, pages 10684–10695, 2022.
  54. Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512, 2022.
  55. Deep unsupervised learning using nonequilibrium thermodynamics. pages 2256–2265, 2015.
  56. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020a.
  57. Generative modeling by estimating gradients of the data distribution. NeurIPS, 32, 2019.
  58. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020b.
  59. Weakly supervised 3D hand pose estimation via biomechanical constraints. In ECCV, pages 211–228, 2020.
  60. Collaborative learning for hand and object reconstruction with attention-guided graph convolution. In CVPR, pages 1664–1674, 2022.
  61. Attention is all you need. In NeurIPS, 2017.
  62. Unidexgrasp++: Improving dexterous grasping policy learning via geometry-aware curriculum and iterative generalist-specialist learning. arXiv preprint arXiv:2304.00464, 2023.
  63. Dexgraspnet: A large-scale robotic dexterous grasp dataset for general objects based on simulation. In ICRA, pages 11359–11366. IEEE, 2023a.
  64. Zero-shot image restoration using denoising diffusion null-space model. ICLR, 2023b.
  65. Hierarchical temporal transformer for 3D hand pose estimation and action recognition from egocentric RGB videos. In CVPR, pages 21243–21253, 2023.
  66. H2onet: Hand-occlusion-and-orientation-aware network for real-time 3d hand mesh reconstruction. In CVPR, pages 17048–17058, 2023a.
  67. Unidexgrasp: Universal robotic dexterous grasping via learning diverse proposal generation and goal-conditioned policy. In CVPR, pages 4737–4746, 2023b.
  68. Seqhand: RGB-sequence-based 3D hand pose and shape estimation. In ECCV, pages 122–139, 2020a.
  69. BiHand: Recovering hand mesh with multi-stage bisected hourglass networks. In BMVC, 2020b.
  70. SemiHand: Semi-supervised hand pose estimation with consistency. In ICCV, pages 11364–11373, 2021.
  71. Artiboost: Boosting articulated 3d hand-object pose estimation via online exploration and synthesis. In CVPR, pages 2750–2760, 2022.
  72. Diffusion-guided reconstruction of everyday hand-object interaction clips. In ICCV, pages 19717–19728, 2023a.
  73. Affordance diffusion: Synthesizing hand-object interactions. In CVPR, pages 22479–22489, 2023b.
  74. Interacting two-hand 3D pose and shape reconstruction from single color image. In ICCV, pages 11354–11363, 2021a.
  75. Adding conditional control to text-to-image diffusion models. In ICCV, pages 3836–3847, 2023.
  76. End-to-end hand mesh recovery from a monocular RGB image. In ICCV, pages 2354–2364, 2019.
  77. Hand image understanding via deep multi-task learning. In ICCV, pages 11281–11292, 2021b.
  78. TravelNet: Self-supervised physically plausible hand motion learning from monocular color images. In ICCV, pages 11666–11676, 2021.
  79. Monocular real-time hand shape and motion capture using multi-modal data. In CVPR, pages 5346–5355, 2020.
  80. TempCLR: Reconstructing hands via time-coherent contrastive learning. In 3DV, pages 627–636. IEEE, 2022.
  81. FreiHAND: A dataset for markerless capture of hand pose and shape from single RGB images. In ICCV, pages 813–822, 2019.
Citations (6)

Summary

We haven't generated a summary for this paper yet.