Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

HanDiffuser: Text-to-Image Generation With Realistic Hand Appearances (2403.01693v3)

Published 4 Mar 2024 in cs.CV and cs.AI

Abstract: Text-to-image generative models can generate high-quality humans, but realism is lost when generating hands. Common artifacts include irregular hand poses, shapes, incorrect numbers of fingers, and physically implausible finger orientations. To generate images with realistic hands, we propose a novel diffusion-based architecture called HanDiffuser that achieves realism by injecting hand embeddings in the generative process. HanDiffuser consists of two components: a Text-to-Hand-Params diffusion model to generate SMPL-Body and MANO-Hand parameters from input text prompts, and a Text-Guided Hand-Params-to-Image diffusion model to synthesize images by conditioning on the prompts and hand parameters generated by the previous component. We incorporate multiple aspects of hand representation, including 3D shapes and joint-level finger positions, orientations and articulations, for robust learning and reliable performance during inference. We conduct extensive quantitative and qualitative experiments and perform user studies to demonstrate the efficacy of our method in generating images with high-quality hands.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (72)
  1. Adobe. Firefly, https://www.adobe.com/sensei/generative-ai/firefly.html, 2023.
  2. Text2action: Generative adversarial synthesis from language to action. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 5915–5920, 2018.
  3. Language2pose: Natural language grounded pose forecasting. In 2019 International Conference on 3D Vision (3DV), pages 719–728, 2019.
  4. Alembics. Disco-Diffusion, https://github.com/alembics/disco-diffusion, 2023.
  5. Openpose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
  6. Pix2seq: A language modeling framework for object detection. In International Conference on Learning Representations (ICLR), 2022.
  7. Mofusion: A framework for denoising-diffusion-based motion synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9760–9770, 2023.
  8. PoseScript: 3D Human Poses from Natural Language. In ECCV, 2022.
  9. PoseFix: Correcting 3D Human Poses with Natural Language. In ICCV, 2023.
  10. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, 2019. Association for Computational Linguistics.
  11. Diffusion models beat gans on image synthesis. In Advances in Neural Information Processing Systems, pages 8780–8794. Curran Associates, Inc., 2021.
  12. Arctic: A dataset for dexterous bimanual hand-object manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12943–12954, 2023.
  13. Make-a-scene: Scene-based text-to-image generation with human priors. In Computer Vision – ECCV 2022, pages 89–106, Cham, 2022. Springer Nature Switzerland.
  14. Synthesis of compositional animations from textual descriptions. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 1396–1406, 2021.
  15. Imos: Intent-driven full-body motion synthesis for human-object interactions. Computer Graphics Forum, 42(2):1–12, 2023.
  16. A systematic survey of prompt engineering on vision-language foundation models. arXiv preprint arXiv:2307.12980, 2023.
  17. Vector quantized diffusion model for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10696–10706, 2022.
  18. Generating diverse and natural 3d human motions from text. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5152–5161, 2022.
  19. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems (NeurIPS), 2017.
  20. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
  21. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, pages 6840–6851. Curran Associates, Inc., 2020.
  22. Cascaded diffusion models for high fidelity image generation. J. Mach. Learn. Res., 23(1), 2022.
  23. Avatarclip: Zero-shot text-driven generation and animation of 3d avatars. ACM Transactions on Graphics (TOG), 41(4):1–19, 2022.
  24. Hand-object interaction image generation. In Advances in Neural Information Processing Systems, pages 23805–23817. Curran Associates, Inc., 2022.
  25. Affordpose: A large-scale dataset of hand-object interactions with affordance-driven hand pose. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 14713–14724, 2023.
  26. Hagrid - hand gesture recognition image dataset. arXiv preprint arXiv:2206.08219, 2022.
  27. Flame: Free-form language-based motion synthesis and editing. Proceedings of the AAAI Conference on Artificial Intelligence, 37(7):8255–8263, 2023.
  28. Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers. Computer Vision and Image Understanding, 141:108–125, 2015.
  29. Generating animated videos of human activities from natural language descriptions. Proceedings of the Visually Grounded Interaction and Language Workshop at NeurIPS, 2018(1), 2018.
  30. Microsoft coco: Common objects in context. In Computer Vision – ECCV 2014, pages 740–755, Cham, 2014. Springer International Publishing.
  31. Ntu rgb+d 120: A large-scale benchmark for 3d human activity understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(10):2684–2701, 2020.
  32. Pseudo numerical methods for diffusion models on manifolds. In International Conference on Learning Representations (ICLR), 2022.
  33. Design guidelines for prompt engineering text-to-image generative models. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, New York, NY, USA, 2022. Association for Computing Machinery.
  34. SMPL: A skinned multi-person linear model. In ACM Trans. Graphics (Proc. SIGGRAPH Asia), 2015.
  35. Handrefiner: Refining malformed hands in generated images by diffusion-based conditional inpainting, 2023.
  36. Amass: Archive of motion capture as surface shapes. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
  37. Generating images from captions with attention. In International Conference on Learning Representations (ICLR), 2016.
  38. SDEdit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations, 2022.
  39. Midjourney. https://www.midjourney.com, 2023.
  40. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In International Conference on Machine Learning(ICML), 2022.
  41. OpenAI. Dall-E 3, https://openai.com/dall-e-3, 2023.
  42. Protores: Proto-residual network for pose authoring via learned inverse kinematics. In The Tenth International Conference on Learning Representations (ICLR), 2022.
  43. On aliased resizing and surprising subtleties in GAN evaluation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, 2022.
  44. Action-conditioned 3d human motion synthesis with transformer vae. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 10985–10995, 2021.
  45. TEMOS: Generating diverse human motions from textual descriptions. In European Conference on Computer Vision (ECCV), 2022.
  46. The kit motion-language dataset. Big data, 4(4):236–252, 2016.
  47. Efficient learning on point clouds with basis point sets. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2019.
  48. Babel: Bodies, action and behavior with english labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 722–731, 2021.
  49. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (ICML), 2021.
  50. Zero-shot text-to-image generation. In Proceedings of the 38th International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
  51. High-resolution image synthesis with latent diffusion models. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022.
  52. Embodied hands: Modeling and capturing hands and bodies together. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 2017.
  53. Photorealistic text-to-image diffusion models with deep language understanding. In Advances in Neural Information Processing Systems(NeurIPS), 2022.
  54. LAION-5b: An open large-scale dataset for training next generation image-text models. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022.
  55. Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the 32nd International Conference on Machine Learning, pages 2256–2265, Lille, France, 2015. PMLR.
  56. Monocular, One-stage, Regression of Multiple 3D People. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
  57. Goal: Generating 4d whole-body motion for hand-object grasping. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13263–13273, 2022.
  58. Grip: Generating interaction poses using latent consistency and spatial cues. arXiv preprint arXiv:2308.11617, 2023.
  59. Df-gan: A simple and effective baseline for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16515–16525, 2022.
  60. Human motion diffusion model. In The Eleventh International Conference on Learning Representations (ICLR), 2023.
  61. Ultralytics. YOLOv8, https://github.com/ultralytics/ultralytics, 2023.
  62. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  63. Reco: Region-controlled text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14246–14255, 2023.
  64. Improving text-to-image synthesis using contrastive learning. The 32nd British Machine Vision Conference (BMVC), 2021.
  65. Clip-actor: Text-driven recommendation and stylization for animating human meshes. In European Conference on Computer Vision (ECCV), 2022.
  66. Mediapipe hands: On-device real-time hand tracking. arXiv preprint arXiv:2006.10214, 2020.
  67. Cross-modal contrastive learning for text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 833–842, 2021.
  68. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3836–3847, 2023.
  69. Motiondiffuse: Text-driven human motion generation with diffusion model. arXiv preprint arXiv:2208.15001, 2022.
  70. Cvt-slr: Contrastive visual-textual transformation for sign language recognition with variational alignment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 23141–23150, 2023.
  71. Toch: Spatio-temporal object-to-hand correspondence for motion refinement. In European Conference on Computer Vision (ECCV), 2022.
  72. Dm-gan: Dynamic memory generative adversarial networks for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
Citations (18)

Summary

We haven't generated a summary for this paper yet.

Youtube Logo Streamline Icon: https://streamlinehq.com