HandRefiner: Refining Malformed Hands in Generated Images by Diffusion-based Conditional Inpainting (2311.17957v2)

Published 29 Nov 2023 in cs.CV

Abstract: Diffusion models have achieved remarkable success in generating realistic images but suffer from generating accurate human hands, such as incorrect finger counts or irregular shapes. This difficulty arises from the complex task of learning the physical structure and pose of hands from training images, which involves extensive deformations and occlusions. For correct hand generation, our paper introduces a lightweight post-processing solution called $\textbf{HandRefiner}$. HandRefiner employs a conditional inpainting approach to rectify malformed hands while leaving other parts of the image untouched. We leverage the hand mesh reconstruction model that consistently adheres to the correct number of fingers and hand shape, while also being capable of fitting the desired hand pose in the generated image. Given a generated failed image due to malformed hands, we utilize ControlNet modules to re-inject such correct hand information. Additionally, we uncover a phase transition phenomenon within ControlNet as we vary the control strength. It enables us to take advantage of more readily available synthetic data without suffering from the domain gap between realistic and synthetic hands. Experiments demonstrate that HandRefiner can significantly improve the generation quality quantitatively and qualitatively. The code is available at https://github.com/wenquanlu/HandRefiner .

Citations (15)

View on Semantic Scholar

Summary

The paper presents HandRefiner, a novel technique that refines malformed hands in generated images by integrating hand mesh reconstruction with diffusion-based conditional inpainting.
The approach uses MediaPipe for hand localization and Mesh Graphormer for depth map estimation to guide an inpainting process that preserves non-hand details.
Experiments demonstrate significant improvements, including FID reductions of over 3 and 10 points on HAGRID and FreiHAND datasets, respectively, alongside higher user preference.

HandRefiner: Refining Malformed Hands in Generated Images by Diffusion-based Conditional Inpainting

The paper under review, titled "HandRefiner: Refining Malformed Hands in Generated Images by Diffusion-based Conditional Inpainting," addresses a significant issue in the field of diffusion models for image generation. Human hands, due to their complex structures and diverse poses, often appear malformed in images generated by state-of-the-art diffusion models like Stable Diffusion and SDXL. The authors present HandRefiner, a novel post-processing method that targets these malformed hands and rectifies them while preserving the integrity of the rest of the image.

The primary motivation for the paper stems from the limitations of current diffusion models in generating realistic human hands. Human hands are complex, possessing 16 joints and 27 degrees of freedom, which make their accurate depiction a challenging task. Various joints and occlusions complicate the learning process, leading to errors like incorrect finger counts and irregular shapes. HandRefiner leverages a conditional inpainting approach, utilizing hand mesh reconstruction to guide the correction process.

Methodology

The proposed HandRefiner framework is composed of two primary stages:

Hand Mesh Reconstruction:
- The initial image, which contains malformed hands, is processed using MediaPipe to automatically localize the hand regions.
- A state-of-the-art hand mesh reconstruction model, Mesh Graphormer, is then utilized to estimate the hand mesh and generate a corresponding depth map. This model ensures anatomically correct hand shapes and structures.
Inpainting:
- The hand regions are masked in the original image, and an inpainting stable diffusion model integrated with ControlNet conditions on the hand depth map to reconstruct the hand regions.
- A masking strategy similar to RePaint is adopted to ensure consistency between the hand and other regions of the image. The DDIM sampler iteratively updates the hand region while preserving the non-hand regions.

Significantly, the paper uncovers a phase transition phenomenon within ControlNet when varying control strengths. This phenomenon is pivotal as it enables leveraging synthetic data for training without experiencing domain gaps between synthetic and realistic hands. The phase transition mechanism ensures a balance between structure accuracy and texture realism in the generated hands.

Results and Discussion

The authors present comprehensive experiments to validate the efficacy of HandRefiner. They evaluate their method using both the HAGRID and FreiHAND datasets. Key metrics include Frechet Inception Distance (FID), Kernel Inception Distance (KID), and Mean Per Joint Position Error (MPJPE). The following key findings were observed:

HandRefiner significantly improved the generation quality of hands, reducing FID by over 3 points and achieving a 1% improvement in keypoint detection confidence on the HAGRID dataset.
On the FreiHAND dataset, HandRefiner achieved an improvement in FID by over 10 points, further emphasizing its effectiveness in hand-specific scenarios.
Subjective evaluations through user surveys indicated that 87 out of 100 rectified images were preferred over the original images, validating the human-perceived improvements.

Furthermore, the adaptive control strength strategy outperformed fixed control strength strategies by finely tuning the balance between structure conformity and texture realism. However, the trade-off between computational costs and performance marks a crucial point for practical applications.

Implications and Future Work

HandRefiner presents a robust post-processing solution for improving the realism of generated human hands in images, advancing the performance of diffusion-based models. It emphasizes that incorporating domain-specific priors, such as hand meshes, can significantly enhance generative models' outputs by providing more precise structure guidance.

From a theoretical perspective, uncovering the phase transition phenomenon within ControlNet opens new avenues for controlling generative models through adaptive control strengths. This insight can be generalized beyond hand generation to improve the quality of generative models conditioned on various control signals.

Future developments could extend HandRefiner's capability to generate interacting hands and more complex hand gestures, which are currently challenging due to reconstruction difficulties and data limitations. Additionally, adapting HandRefiner to work with newer models like DiT and exploring its application in other domains like animal image generation could further validate its versatility and impact.

In conclusion, HandRefiner marks a significant step toward refining specific object details in generative images, leveraging diffusion models' potentials while addressing their current limitations. It sets a foundation for future explorations in improving generative AI's fidelity, encouraging the development of more sophisticated post-processing techniques in the field.

PDF Markdown

Related Papers

GitHub

GitHub - wenquanlu/HandRefiner (684 stars)

Tweets

https://twitter.com/camenduru/status/1744372122267349429

https://twitter.com/M_S_I_N_M/status/1744507814368919574

YouTube

Show All Videos