Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 57 tok/s

Gemini 2.5 Pro 52 tok/s Pro

GPT-5 Medium 20 tok/s Pro

GPT-5 High 19 tok/s Pro

GPT-4o 93 tok/s Pro

Kimi K2 176 tok/s Pro

GPT OSS 120B 449 tok/s Pro

Claude Sonnet 4.5 35 tok/s Pro

2000 character limit reached

SketchAgent: Language-Driven Sequential Sketch Generation (2411.17673v1)

Published 26 Nov 2024 in cs.CV

Abstract: Sketching serves as a versatile tool for externalizing ideas, enabling rapid exploration and visual communication that spans various disciplines. While artificial systems have driven substantial advances in content creation and human-computer interaction, capturing the dynamic and abstract nature of human sketching remains challenging. In this work, we introduce SketchAgent, a language-driven, sequential sketch generation method that enables users to create, modify, and refine sketches through dynamic, conversational interactions. Our approach requires no training or fine-tuning. Instead, we leverage the sequential nature and rich prior knowledge of off-the-shelf multimodal LLMs. We present an intuitive sketching language, introduced to the model through in-context examples, enabling it to "draw" using string-based actions. These are processed into vector graphics and then rendered to create a sketch on a pixel canvas, which can be accessed again for further tasks. By drawing stroke by stroke, our agent captures the evolving, dynamic qualities intrinsic to sketching. We demonstrate that SketchAgent can generate sketches from diverse prompts, engage in dialogue-driven drawing, and collaborate meaningfully with human users.

Summary

The paper introduces a language-driven method that iteratively creates semantically meaningful sketches using off-the-shelf multimodal LLMs without task-specific training.
It employs a novel sketching language and grid-based canvas to translate textual inputs into vector graphics via cubic Bezier curves, achieving near human-level CLIP-based recognition accuracy.
The study demonstrates a collaborative human-AI sketch editing process, highlighting its potential in creative applications while acknowledging challenges with detailed human figures.

SketchAgent: Language-Driven Sequential Sketch Generation

The paper introduces SketchAgent, an innovative approach for facilitating language-driven, sequential sketch generation without the need for task-specific training or fine-tuning. Leveraging the inherent capabilities of off-the-shelf multimodal LLMs, SketchAgent generates, refines, and edits sketches through dynamic interactions based on textual inputs.

The advent of SketchAgent addresses a distinctive gap in the field of sketch generation—replicating the dynamic, sequential, and iterative nature of human sketching which is often missing in traditional static and dataset-dependent methods. Unlike existing vision-LLMs that directly create images from text in a single attempt, SketchAgent sequentially constructs sketches, allowing each stroke to carry semantic meaning derived from language instructions. This iterative approach provides visual coherence and facilitates meaningful interaction with human users in real time.

At its core, SketchAgent introduces a novel sketching language and grid-based canvas that enhances the agent’s spatial reasoning capabilities, which are typically limited in existing multimodal LLMs. This approach enables SketchAgent to effectively translate textual inputs into vector graphics through a smooth interpolation of user-defined points using cubic Bezier curves, bridging the gap between textual instructions and visual execution.

Additionally, the paper presents a thorough evaluation of SketchAgent’s capabilities across text-conditioned sketch generation using concepts beyond predefined datasets, such as scientific diagrams and notable landmarks. Notably, SketchAgent outperforms several baseline models in CLIP-based recognition accuracy, approaching human-level performance, which demonstrates its effectiveness in generating contextually accurate sketches.

One of the novel aspects of SketchAgent is its ability to facilitate collaborative sketching with human participants. The paper provides evidence that sketches collaboratively produced by SketchAgent and human users yield high recognition accuracy, leveraging the strength of both the agent’s prior knowledge and the user’s creative intent. This demonstrates a potent combination of human-machine interaction that could have practical implications in creative industries, education, and design, where iterative and collaborative sketching is valued.

Furthermore, the chat-based sketch editing feature demonstrates SketchAgent’s capability to perform fine adjustments and additions to sketches in response to textual instructions. This allows for an iterative process of refinement that aligns closely with how human artists might revise their work.

However, the paper acknowledges limitations, such as challenges in rendering detailed human figures and overfitting to ICL prompts. These limitations offer valuable insights into potential avenues for future work, including enhancing the visual reasoning capabilities of LLMs and refining the pipeline to support a wider range of complex visual tasks.

In conclusion, SketchAgent exemplifies a significant step forward in leveraging multimodal LLMs for sketch generation. By embracing the dynamic nature of human sketching, it opens new avenues for AI applications in creative domains and sets a foundation for future advancements in AI-driven visual communication and interaction. The research thus holds potential to inspire further developments in user-centered AI tools that support creative expression and collaboration across diverse fields.