- The paper introduces the Human Artifact Dataset (HAD) of over 37,000 annotated images to detect local and global human structural errors.
- The proposed Human Artifact Detection Models (HADM) outperform state-of-the-art vision-language models and generalize to unseen text-to-image generators.
- Integrating artifact detection with iterative inpainting and finetuning diffusion models significantly improves image fidelity and human anatomical coherence.
Detecting Human Artifacts from Text-to-Image Models
The paper "Detecting Human Artifacts from Text-to-Image Models" by Kaihong Wang et al. extensively explores the challenge of identifying and minimizing human artifacts in images produced by text-to-image generative models. Despite significant strides in text-to-image models, issues persist particularly in generating human figures, which frequently exhibit distorted, missing, or extra body parts. These artifacts not only impair visual fidelity but also contradict typical human anatomical structures.
The authors present a comprehensive approach to tackle these issues by introducing the Human Artifact Dataset (HAD), the first of its kind at scale, specifically constructed to identify and localize human artifacts. HAD consists of over 37,000 images generated by popular text-to-image models like SDXL, DALLE-2, DALLE-3, and Midjourney, each annotated for human artifact localization. This dataset underscores two primary categories of human artifacts: local artifacts, which involve specific body parts rendered poorly, and global artifacts, which involve broader anatomical inconsistencies such as additional or missing limbs or features.
Training on this dataset, the authors develop Human Artifact Detection Models (HADM), which exhibit robust performance in detecting a variety of human artifacts. Evaluation results indicate that HADM not only excel in identifying artifacts from different generative models, including those not seen during training, but also outperform existing state-of-the-art vision-language (VL) models in this specialized task. Interestingly, the models consistently perform better with generators characterized by a higher prevalence of artifacts, like SDXL, compared to more advanced models like DALLE-3 and Midjourney, which involve subtler artifacts.
Furthermore, the paper explores utilizing these detection models to guide improvements in generator models. The authors harness predictions from HADM for finetuning diffusion models, introducing special identifiers for detected artifacts during training. This finetuning step, validated by user preference studies and quantitative assessments, leads to a noticeable reduction in human artifacts, thereby enhancing human structural coherence and maintaining image quality. Additionally, HADM provides a novel application within an iterative inpainting framework to rectify detected structural inconsistencies, iteratively reducing artifacts through guided inpainting operations.
The work set forth by Wang et al. has significant implications. Practically, it enhances the fidelity of content generated by text-to-image models, which is crucial for applications requiring high visual accuracy, such as virtual reality, digital art production, and content creation for media. Theoretically, this study contributes to the broader discourse on enhancing the structural coherence of neural model outputs, addressing longstanding challenges in synthetic image generation of human forms.
In contemplating future directions, further refinement of detection models could explore more nuanced artifact categories or integrate multi-scale detection architectures for enhanced granularity. Additionally, the development of cross-modal training paradigms leveraging unsupervised or semi-supervised methods could reduce dependency on large annotated datasets. Ultimately, advancing our understanding of human artifact detection and mitigation will continue to be pivotal as AI-driven content generation further integrates into creative and commercial processes.