Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Point and Instruct: Enabling Precise Image Editing by Unifying Direct Manipulation and Text Instructions (2402.07925v1)

Published 5 Feb 2024 in cs.AI and cs.HC

Abstract: Machine learning has enabled the development of powerful systems capable of editing images from natural language instructions. However, in many common scenarios it is difficult for users to specify precise image transformations with text alone. For example, in an image with several dogs, it is difficult to select a particular dog and move it to a precise location. Doing this with text alone would require a complex prompt that disambiguates the target dog and describes the destination. However, direct manipulation is well suited to visual tasks like selecting objects and specifying locations. We introduce Point and Instruct, a system for seamlessly combining familiar direct manipulation and textual instructions to enable precise image manipulation. With our system, a user can visually mark objects and locations, and reference them in textual instructions. This allows users to benefit from both the visual descriptiveness of natural language and the spatial precision of direct manipulation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. 2024. React. https://react.dev/. Accessed: 2024-01-25.
  2. 2024. tldraw. https://tldraw.dev/. Accessed: 2024-01-25.
  3. InstructPix2Pix: Learning to Follow Image Editing Instructions. https://doi.org/10.48550/arXiv.2211.09800 arXiv:2211.09800 [cs].
  4. Language Models are Few-Shot Learners. https://doi.org/10.48550/arXiv.2005.14165 arXiv:2005.14165 [cs].
  5. Training-Free Layout Control with Cross-Attention Guidance. https://doi.org/10.48550/arXiv.2304.03373 arXiv:2304.03373 [cs].
  6. Philip R. Cohen. 1992. The role of natural language in a multimodal interface. In Proceedings of the 5th Annual ACM Symposium on User Interface Software and Technology (Monteray, California, USA) (UIST ’92). Association for Computing Machinery, New York, NY, USA, 143–149. https://doi.org/10.1145/142621.142641
  7. Synergistic use of direct manipulation and natural language. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ’89). Association for Computing Machinery, New York, NY, USA, 227–233. https://doi.org/10.1145/67449.67494
  8. DiffEdit: Diffusion-based semantic image editing with mask guidance. https://doi.org/10.48550/arXiv.2210.11427 arXiv:2210.11427 [cs].
  9. LayoutGPT: Compositional Visual Planning and Generation with Large Language Models. https://doi.org/10.48550/arXiv.2305.15393 arXiv:2305.15393 [cs].
  10. Generative Adversarial Networks. https://doi.org/10.48550/arXiv.1406.2661 arXiv:1406.2661 [cs, stat].
  11. Prompt-to-Prompt Image Editing with Cross Attention Control. https://doi.org/10.48550/arXiv.2208.01626 arXiv:2208.01626 [cs].
  12. TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation with Question Answering. https://doi.org/10.48550/arXiv.2303.11897 arXiv:2303.11897 [cs].
  13. Direct Manipulation Interfaces. Human–Computer Interaction 1, 4 (Dec. 1985), 311–338. https://doi.org/10.1207/s15327051hci0104_2 Publisher: Taylor & Francis _eprint: https://doi.org/10.1207/s15327051hci0104_2.
  14. DreamEdit: Subject-driven Image Editing. https://doi.org/10.48550/arXiv.2306.12624 arXiv:2306.12624 [cs].
  15. GLIGEN: Open-Set Grounded Text-to-Image Generation. https://doi.org/10.48550/arXiv.2301.07093 arXiv:2301.07093 [cs].
  16. LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models. http://arxiv.org/abs/2305.13655 arXiv:2305.13655 [cs].
  17. Visual Instruction Tuning. https://doi.org/10.48550/arXiv.2304.08485 arXiv:2304.08485 [cs].
  18. Compositional Visual Generation with Composable Diffusion Models. https://arxiv.org/abs/2206.01714v6
  19. MetaICL: Learning to Learn In Context. https://doi.org/10.48550/arXiv.2110.15943 arXiv:2110.15943 [cs].
  20. Zero-shot Image-to-Image Translation. https://doi.org/10.48550/arXiv.2302.03027 arXiv:2302.03027 [cs].
  21. C. Raymond Perrault and Barbara J. Grosz. 1988. Chapter 4 - Natural-Language Interfaces. In Exploring Artificial Intelligence, Howard E. Shrobe and the American Association for Artificial Intelligence (Eds.). Morgan Kaufmann, 133–172. https://doi.org/10.1016/B978-0-934613-67-5.50008-3
  22. Zero-Shot Text-to-Image Generation. https://doi.org/10.48550/arXiv.2102.12092 arXiv:2102.12092 [cs].
  23. High-Resolution Image Synthesis with Latent Diffusion Models. https://doi.org/10.48550/arXiv.2112.10752 arXiv:2112.10752 [cs].
  24. DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation. https://doi.org/10.48550/arXiv.2208.12242 arXiv:2208.12242 [cs].
  25. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. https://doi.org/10.48550/arXiv.2205.11487 arXiv:2205.11487 [cs].
  26. Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models. 22522–22531. https://openaccess.thecvf.com/content/CVPR2023/html/Schramowski_Safe_Latent_Diffusion_Mitigating_Inappropriate_Degeneration_in_Diffusion_Models_CVPR_2023_paper.html
  27. LAION-5B: An open large-scale dataset for training next generation image-text models. arXiv:2210.08402 [cs.CV]
  28. Emu Edit: Precise Image Editing via Recognition and Generation Tasks. https://doi.org/10.48550/arXiv.2311.10089 arXiv:2311.10089 [cs].
  29. Ben Shneiderman. 1980. Natural vs. precise concise languages for human operation of computers: research issues and experimental approaches. In Proceedings of the 18th Annual Meeting on Association for Computational Linguistics (Philadelphia, Pennsylvania) (ACL ’80). Association for Computational Linguistics, USA, 139–141. https://doi.org/10.3115/981436.981478
  30. BEN SHNEIDERMAN. 1982. The future of interactive systems and the emergence of direct manipulation†. Behaviour & Information Technology 1, 3 (July 1982), 237–256. https://doi.org/10.1080/01449298208914450 Publisher: Taylor & Francis _eprint: https://doi.org/10.1080/01449298208914450.
  31. Ben Shneiderman. 1983. Direct manipulation: A step beyond programming languages. Computer 16, 08 (1983), 57–69.
  32. Ben Shneiderman and Pattie Maes. 1997. Direct manipulation vs. interface agents. interactions 4, 6 (1997), 42–61.
  33. Andries van Dam. 1997. Post-WIMP user interfaces. Commun. ACM 40, 2 (Feb. 1997), 63–67. https://doi.org/10.1145/253671.253708
  34. DiffusionDB: A Large-scale Prompt Gallery Dataset for Text-to-Image Generative Models. arXiv:2210.14896 [cs.CV]
  35. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. https://doi.org/10.48550/arXiv.2201.11903 arXiv:2201.11903 [cs].

Summary

We haven't generated a summary for this paper yet.