Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
124 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

RHanDS: Refining Malformed Hands for Generated Images with Decoupled Structure and Style Guidance (2404.13984v2)

Published 22 Apr 2024 in cs.CV

Abstract: Although diffusion models can generate high-quality human images, their applications are limited by the instability in generating hands with correct structures. In this paper, we introduce RHanDS, a conditional diffusion-based framework designed to refine malformed hands by utilizing decoupled structure and style guidance. The hand mesh reconstructed from the malformed hand offers structure guidance for correcting the structure of the hand, while the malformed hand itself provides style guidance for preserving the style of the hand. To alleviate the mutual interference between style and structure guidance, we introduce a two-stage training strategy and build a series of multi-style hand datasets. In the first stage, we use paired hand images for training to ensure stylistic consistency in hand refining. In the second stage, various hand images generated based on human meshes are used for training, enabling the model to gain control over the hand structure. Experimental results demonstrate that RHanDS can effectively refine hand structure while preserving consistency in hand style.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. Synthesis AI. [n. d.]. Static gestures dataset. https://synthesis.ai/static-gestures-dataset/.
  2. Stability AI. 2022a. Stable Diffusion Inpainting v1.5. https://huggingface.co/runwayml/stable-diffusion-inpainting/.
  3. Stability AI. 2022b. Stable Diffusion v1.5. https://huggingface.co/runwayml/stable-diffusion-v1-5/.
  4. Zoedepth: Zero-shot transfer by combining relative and metric depth. arXiv preprint arXiv:2302.12288 (2023).
  5. Mobrecon: Mobile-friendly hand mesh reconstruction from monocular image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 20544–20554.
  6. MMPose Contributors. 2020. OpenMMLab Pose Estimation Toolbox and Benchmark. https://github.com/open-mmlab/mmpose.
  7. An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion. In The Eleventh International Conference on Learning Representations.
  8. Concept sliders: Lora adaptors for precise control in diffusion models. arXiv preprint arXiv:2311.12092 (2023).
  9. Image style transfer using convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2414–2423.
  10. Nicholas Guttenberg. 2023. Diffusion with offset noise.
  11. Efficient Diffusion Training via Min-SNR Weighting Strategy. In IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023. IEEE, 7407–7417. https://doi.org/10.1109/ICCV51070.2023.00684
  12. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems 30 (2017).
  13. Denoising Diffusion Probabilistic Models. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (Eds.). https://proceedings.neurips.cc/paper/2020/hash/4c5bcfec8584af0d967f1ab10179ca4b-Abstract.html
  14. LoRA: Low-Rank Adaptation of Large Language Models. In International Conference on Learning Representations.
  15. OpenCLIP. https://doi.org/10.5281/zenodo.5143773
  16. Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE transactions on pattern analysis and machine intelligence 36, 7 (2013), 1325–1339.
  17. Human-art: A versatile human-centric dataset bridging natural and artificial scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 618–629.
  18. HaGRID–HAnd Gesture Recognition Image Dataset. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 4572–4581.
  19. SMPL: A skinned multi-person linear model. In Seminal Graphics Papers: Pushing the Boundaries, Volume 2. 851–866.
  20. HandRefiner: Refining Malformed Hands in Generated Images by Diffusion-based Conditional Inpainting. arXiv preprint arXiv:2311.17957 (2023).
  21. Mediapipe: A framework for building perception pipelines. arXiv preprint arXiv:1906.08172 (2019).
  22. SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net. https://openreview.net/forum?id=aBsCjcPu_tE
  23. T2I-Adapter: Learning Adapters to Dig Out More Controllable Ability for Text-to-Image Diffusion Models. In Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence, IAAI 2024, Fourteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2014, February 20-27, 2024, Vancouver, Canada, Michael J. Wooldridge, Jennifer G. Dy, and Sriraam Natarajan (Eds.). AAAI Press, 4296–4304. https://doi.org/10.1609/AAAI.V38I5.28226
  24. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. In International Conference on Machine Learning. PMLR, 16784–16804.
  25. Expressive Body Capture: 3D Hands, Face, and Body from a Single Image. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR).
  26. SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis. In The Twelfth International Conference on Learning Representations.
  27. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
  28. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 1, 2 (2022), 3.
  29. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10684–10695.
  30. Embodied hands: modeling and capturing hands and bodies together. ACM Trans. Graph. 36, 6 (2017), 245:1–245:17. https://doi.org/10.1145/3130800.3130883
  31. Embodied Hands: Modeling and Capturing Hands and Bodies Together. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia) 36, 6 (Nov. 2017).
  32. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 22500–22510.
  33. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems 35 (2022), 36479–36494.
  34. ultralytics. 2022. YOLOv8. https://github.com/ultralytics/ultralytics.
  35. Neural discrete representation learning. Advances in neural information processing systems 30 (2017).
  36. Diffusion-hpc: Generating synthetic images with realistic humans. arXiv preprint arXiv:2303.09541 (2023).
  37. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. arXiv preprint arXiv:2308.06721 (2023).
  38. Affordance diffusion: Synthesizing hand-object interactions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 22479–22489.
  39. Mediapipe hands: On-device real-time hand tracking. arXiv preprint arXiv:2006.10214 (2020).
  40. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3836–3847.
  41. Uni-controlnet: All-in-one control to text-to-image diffusion models. Advances in Neural Information Processing Systems 36 (2024).
  42. ZhUyU1997. [n. d.]. open-pose-editor. https://github.com/ZhUyU1997/open-pose-editor.
  43. Christian Zimmermann and Thomas Brox. 2017. Learning to Estimate 3D Hand Pose from Single RGB Images. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017. IEEE Computer Society, 4913–4921. https://doi.org/10.1109/ICCV.2017.525
Citations (2)

Summary

We haven't generated a summary for this paper yet.