Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Dual-Process Image Generation (2506.01955v1)

Published 2 Jun 2025 in cs.CV, cs.CL, and cs.LG

Abstract: Prior methods for controlling image generation are limited in their ability to be taught new tasks. In contrast, vision-LLMs, or VLMs, can learn tasks in-context and produce the correct outputs for a given input. We propose a dual-process distillation scheme that allows feed-forward image generators to learn new tasks from deliberative VLMs. Our scheme uses a VLM to rate the generated images and backpropagates this gradient to update the weights of the image generator. Our general framework enables a wide variety of new control tasks through the same text-and-image based interface. We showcase a handful of applications of this technique for different types of control signals, such as commonsense inferences and visual prompts. With our method, users can implement multimodal controls for properties such as color palette, line weight, horizon position, and relative depth within a matter of minutes. Project page: https://dual-process.github.io.

Summary

  • The paper introduces a dual-process distillation method that integrates feed-forward generators with deliberative vision-language models to tackle control tasks.
  • It employs gradient-based distillation with low-rank adaptation, achieving at least a 20% improvement in commonsense reasoning and spatial accuracy over baselines.
  • The approach leverages off-the-shelf models to quickly adapt to diverse control tasks, paving the way for more integrated and versatile AI systems.

Dual-Process Image Generation: Synthesis of Multimodal Control

The paper, "Dual-Process Image Generation," introduces an innovative method that leverages the strengths of Vision-LLMs (VLMs) to enhance the functionality and task adaptability of image generation models. This research is focused on overcoming the limitations of existing image generators, which struggle with learning new tasks and executing controllable outputs. The authors propose a dual-process distillation method that combines feed-forward image generators with deliberative VLMs. This approach enables image generators to learn a diverse array of control tasks seamlessly.

The novelty of their dual-process architecture draws inspiration from cognitive science, adopting a dichotomous structure analogous to "System 1" and "System 2" thinking—a fast, reflexive process, and a slow, deliberative process, respectively. The VLM acts as the deliberative component, providing insights and judgments that inform the image generator's reflexive actions. The integration allows image generators to inherit the capacity of VLMs to process multimodal inputs and make contextually informed decisions.

Significant aspects of this work include the method's ability to employ off-the-shelf models, eliminating the need for specialized retraining. The authors utilize a gradient-based distillation scheme that involves backpropagating feedback from VLMs into the image generators. This is facilitated through techniques such as low-rank adaptation (LoRA), which tweaks the generator's weights based on the VLM's evaluations.

Key empirical results illustrate the versatility of the dual-process approach. The authors showcase various control tasks, such as altering image attributes like color palettes, line weight, horizon position, and enforcing relative depth—a feat that demonstrates the model's advanced spatial reasoning. These control tasks can be performed rapidly, making them accessible even to users with minimal computational resources. The effectiveness of their method is validated against baselines, where it substantially improves on commonsense understanding and the physical accuracy of generated images, achieving clear gains of at least 20% over baseline models.

In terms of future implications, this research paves the way for more integrated and adaptable AI systems. It suggests the potential for LLMs and VLMs to work in tandem with other domain-specific models, broadening the scope of AI applications. Furthermore, the method's applicability to off-the-shelf models underscores its potential for widespread adoption.

The paper also highlights areas where the dual-process model can face challenges. A notable limitation includes susceptibility to misinterpreting control prompts, leading to unintended outputs. While these could be mitigated through improved VLM training or prompt engineering, they still require further refinement.

In conclusion, the authors present a compelling vision of leveraging the complementary strengths of VLMs and image generators. By distilling contextual understanding from VLMs into practical image generation tasks, they not only advance the capability of AI models to incorporate commonsense reasoning but also broaden the horizon for future developments in multimodal AI architectures. This work serves as a foundational step towards more intelligent and contextual image synthesis systems, with promising implications for both theoretical research and practical applications.

Github Logo Streamline Icon: https://streamlinehq.com