Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
166 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Prompt-Propose-Verify: A Reliable Hand-Object-Interaction Data Generation Framework using Foundational Models (2312.15247v1)

Published 23 Dec 2023 in cs.CV and cs.AI

Abstract: Diffusion models when conditioned on text prompts, generate realistic-looking images with intricate details. But most of these pre-trained models fail to generate accurate images when it comes to human features like hands, teeth, etc. We hypothesize that this inability of diffusion models can be overcome through well-annotated good-quality data. In this paper, we look specifically into improving the hand-object-interaction image generation using diffusion models. We collect a well annotated hand-object interaction synthetic dataset curated using Prompt-Propose-Verify framework and finetune a stable diffusion model on it. We evaluate the image-text dataset on qualitative and quantitative metrics like CLIPScore, ImageReward, Fedility, and alignment and show considerably better performance over the current state-of-the-art benchmarks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. Synthetic Data from Diffusion Models Improves ImageNet Classification. arXiv:2304.08466.
  2. Inspecting the Geographical Representativeness of Images from Text-to-Image Models. arXiv:2305.11080.
  3. Artificial hallucination: GPT on LSD? Critical Care, 27(1): 148.
  4. Enhancement of Image Classification Using Transfer Learning and GAN-Based Synthetic Data Augmentation. Mathematics, 10(9).
  5. Training Verifiers to Solve Math Word Problems. arXiv:2110.14168.
  6. DiffEdit: Diffusion-based semantic image editing with mask guidance. arXiv:2210.11427.
  7. GAN-based synthetic medical image augmentation for increased CNN performance in liver lesion classification. Neurocomputing, 321: 321–331.
  8. An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion. arXiv:2208.01618.
  9. A Systematic Study of Bias Amplification. arXiv:2201.11706.
  10. Comparative Analysis of Language Models: hallucinations in ChatGPT: Prompt Study.
  11. CLIPScore: A Reference-free Evaluation Metric for Image Captioning. arXiv:2104.08718.
  12. LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685.
  13. Fault-Aware Neural Code Rankers. arXiv:2206.03865.
  14. AffordPose: A Large-scale Dataset of Hand-Object Interactions with Affordance-driven Hand Pose. arXiv:2309.08942.
  15. GAN-Based Synthetic Data Augmentation for Infrared Small Target Detection. IEEE Transactions on Geoscience and Remote Sensing, 60: 1–12.
  16. ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision. arXiv:2102.03334.
  17. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13): 3521–3526.
  18. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123: 32–73.
  19. BigDatasetGAN: Synthesizing ImageNet with Pixel-wise Annotations. arXiv:2201.04684.
  20. Microsoft COCO: Common Objects in Context. arXiv:1405.0312.
  21. Stable Bias: Analyzing Societal Representations in Diffusion Models. arXiv:2303.11408.
  22. SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis. arXiv:2307.01952.
  23. High-Resolution Image Synthesis with Latent Diffusion Models. arXiv:2112.10752.
  24. DreamBooth: Fine Tuning Text-to-image Diffusion Models for Subject-Driven Generation.
  25. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. arXiv:2205.11487.
  26. LAION-5B: An open large-scale dataset for training next generation image-text models. arXiv:2210.08402.
  27. Effective Data Augmentation With Diffusion Models. arXiv:2302.07944.
  28. HandsOff: Labeled Dataset Generation With No Additional Human Annotations. arXiv:2212.12645.
  29. ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation. arXiv:2304.05977.
  30. Paint by Example: Exemplar-based Image Editing with Diffusion Models. arXiv:2211.13227.
  31. Automatic Hallucination Assessment for Aligned Large Language Models via Transferable Adversarial Attacks. arXiv:2310.12516.
  32. DatasetGAN: Efficient Labeled Data Factory with Minimal Human Effort. In CVPR.

Summary

We haven't generated a summary for this paper yet.