ComfyGen: Prompt-Adaptive Workflows for Text-to-Image Generation (2410.01731v1)

Published 2 Oct 2024 in cs.CV, cs.CL, and cs.GR

Abstract: The practical use of text-to-image generation has evolved from simple, monolithic models to complex workflows that combine multiple specialized components. While workflow-based approaches can lead to improved image quality, crafting effective workflows requires significant expertise, owing to the large number of available components, their complex inter-dependence, and their dependence on the generation prompt. Here, we introduce the novel task of prompt-adaptive workflow generation, where the goal is to automatically tailor a workflow to each user prompt. We propose two LLM-based approaches to tackle this task: a tuning-based method that learns from user-preference data, and a training-free method that uses the LLM to select existing flows. Both approaches lead to improved image quality when compared to monolithic models or generic, prompt-independent workflows. Our work shows that prompt-dependent flow prediction offers a new pathway to improving text-to-image generation quality, complementing existing research directions in the field.

Citations (1)

View on Semantic Scholar

Summary

The paper introduces prompt-adaptive T2I workflows through two methods—ComfyGen-IC and ComfyGen-FT—that tailor generation processes to user prompts.
The paper demonstrates that ComfyGen-FT outperforms traditional monolithic models in both prompt alignment and image quality, validated by GenEval and human preference metrics.
The paper outlines a scalable, modular approach using ComfyUI and JSON-based workflows, paving the way for diverse and flexible generative applications.

An Expert Overview of "ComfyGen: Prompt-Adaptive Workflows for Text-to-Image Generation"

The paper "ComfyGen: Prompt-Adaptive Workflows for Text-to-Image Generation" by Rinon Gal et al. confronts the dynamically evolving landscape of text-to-image (T2I) generation. It addresses a notable transition from utilizing simple, monolithic models to intricate, multi-component workflows. The necessity for this shift stems from the observation that prompt-specific adjustments to workflows can substantially improve the quality of generated images.

Summary

This research proposes prompt-adaptive workflow generation, aiming to automate the synthesis of T2I workflows tailored to user prompts. It conceptualizes two distinct methodologies:

ComfyGen-IC (In-Context): Utilizes a closed-source LLM to match prompts with suitable workflows derived from a precompiled table of flow performance across various categories.
ComfyGen-FT (Fine-Tuned): Involves fine-tuning an open LLM to predict workflows based on input prompts and target scores derived from training data.

Methodology

Central to their approach is the use of ComfyUI, a flexible and extensible framework for designing generative workflows in a modular fashion. ComfyUI workflows are saved as JSON files, which facilitates compatibility with contemporary LLMs.

The paper assembles a dataset of workflows and prompts, ensuring a diverse set of T2I generation scenarios. They utilized a combination of aesthetic predictors and human preference models to score the images generated by these workflows. The dataset comprises triplets of (prompt, flow, score), enabling effective learning of prompt-flow associations.

ComfyGen-IC

This approach provides the LLM with a performance table for each flow over a set of predefined categories. For a novel prompt, the model dissects the prompt into category-specific attributes and selects the highest-scoring workflow accordingly.

ComfyGen-FT

In contrast, ComfyGen-FT requires fine-tuning where the LLM learns to predict the suitable workflow based on a given prompt and a target score. During inference, a high target score is provided to obtain superior flows for new prompts.

Experimental Evaluation

The efficacy of ComfyGen was rigorously evaluated against multiple baselines:

Monolithic Models: Baselines included popular models like SDXL, JuggernautXL, DreamShaperXL, and a DPO-optimized SDXL variant.
Constant Workflows: Automated images were generated using the most popular pre-defined workflows from ComfyUI.

The authors utilized the GenEval benchmark to evaluate prompt alignment and a selection of $500$ prompts from CivitAI to assess image quality. Both human preferences and automatic scoring metrics (HPS v2.0) were employed to validate performance improvements.

Results

ComfyGen-FT exhibited superior performance, surpassing both monolithic models and constant workflows in prompt alignment and image quality benchmarks. Notably, ComfyGen-FT was particularly effective in generating diverse and suitable workflows, demonstrating robust adaptability to varying prompts across different domains and styles.

Analysis and Implications

The analysis revealed several insights into the selection and optimization of workflows:

Diversity and Originality: ComfyGen-FT showed greater variety in workflow selection compared to ComfyGen-IC, indicating a capacity for more nuanced prompt-adaptive synthesis.
Performance Patterns: Utilizing TF-IDF ranking, the authors identified human-interpretable patterns in workflow component selection, aligning with domain-specific attributes of the prompts.

Future Directions

The research outlines paths for future exploration:

Scalability: Enhancing the approach to efficiently scale with larger prompt and workflow datasets could further refine prompt-specific generation.
Generative Flexibility: Extending the scope beyond T2I generation to include image-to-image or video generation workflows.
Collaborative Agents: Developing intermediary methods to enable real-time feedback and iterative enhancement of workflows through human-LLM interaction.

Conclusion

This paper introduces a significant advancement in T2I generative models through prompt-adaptive workflows. The two approaches—ComfyGen-IC and ComfyGen-FT—underscore the critical importance of workflow adaptability to prompt-specific contexts, paving the way for more intelligent and quality-driven image generation systems in future AI research.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/RinonGal/status/1841739872198865109

https://twitter.com/gm8xx8/status/1841741100526272610

https://twitter.com/fly51fly/status/1842536194724528198

https://twitter.com/TechPractice1/status/1849959568593322072

https://twitter.com/cognitivetech_/status/1842150099192889787

https://twitter.com/javaeeeee1/status/1841962839021330613

Reddit

[2410.01731] ComfyGen: Prompt-Adaptive Workflows for Text-to-Image Generation (1 point, 0 comments)