Text-Guided Synthesis of Artistic Images with Retrieval-Augmented Diffusion Models (2207.13038v1)

Published 26 Jul 2022 in cs.CV

Abstract: Novel architectures have recently improved generative image synthesis leading to excellent visual quality in various tasks. Of particular note is the field of AI-Art'', which has seen unprecedented growth with the emergence of powerful multimodal models such as CLIP. By combining speech and image synthesis models, so-calledprompt-engineering'' has become established, in which carefully selected and composed sentences are used to achieve a certain visual style in the synthesized image. In this note, we present an alternative approach based on retrieval-augmented diffusion models (RDMs). In RDMs, a set of nearest neighbors is retrieved from an external database during training for each training instance, and the diffusion model is conditioned on these informative samples. During inference (sampling), we replace the retrieval database with a more specialized database that contains, for example, only images of a particular visual style. This provides a novel way to prompt a general trained model after training and thereby specify a particular visual style. As shown by our experiments, this approach is superior to specifying the visual style within the text prompt. We open-source code and model weights at https://github.com/CompVis/latent-diffusion .

PDF Abstract

Text-Guided Synthesis of Artistic Images with Retrieval-Augmented Diffusion Models

This paper introduces a novel approach in the domain of generative image synthesis, specifically focusing on the intersection of art and AI. The authors present a method utilizing Retrieval-Augmented Diffusion Models (RDMs) to enhance text-guided artistic image synthesis, taking advantage of retrieval mechanisms from external databases to condition the diffusion process.

Methodology

The paper outlines a robust methodology built around conditional latent diffusion models coupled with a retrieval strategy from an external image database. At the core, the model leverages CLIP—a multimodal vision-LLM—to retrieve contextually relevant images during training. This involves constructing a training dataset with $k$ nearest neighbors retrieved for each instance based on similarity in CLIP's image embedding space. These neighbors are then integrated via a cross-attention mechanism into the generative model.

Post-training, the retrieval database is swapped for a specialized style-specific database, such as subsets from WikiArt or ArtBench, enabling zero-shot stylization. This facilitates an adaptable styling process, effectively separating style control from the primary training process and reducing computational demand.

Experimental Results

The experimental setup includes two primary models: one trained on ImageNet data demonstrating general stylization capabilities, and a larger model utilizing the LAION-2B-en dataset showcasing fine-grained artistic control. Both models illustrate the capacity to generate high-quality artistic images without explicit image-to-text training, due to the retrieval mechanism providing relevant style context.

The results section highlights that the RDM approach, even at a smaller computational footprint, outperforms traditional prompt-engineering methods that append style descriptors. Quantitative evaluations using a style classifier confirm that retrieval-based stylization often surpasses postfix text-based stylization, as evidenced by significant improvements in style accuracy.

Analysis and Implications

The implications of this work are multifaceted. Practically, it provides a versatile tool for artists who can harness the generative potential without deep technical immersion in prompt crafting. Theoretically, it reveals how memory-augmented models can sculpt neural networks' output through externally provided context, presenting a paradigm shift in separating model learning from stylistic expression.

The data retrieval aspect also opens avenues for further research into memory-enhanced neural architectures. Future work could explore integrating fine-tuning techniques on paired text-image data to further streamline and expand stylistic control.

Conclusion

The paper offers a comprehensive methodology with impactful results in generating stylistically diverse images through retrieval-augmented models. This diffusion-based approach, backed by a sophisticated retrieval mechanism, marks a substantive contribution to AI-art synthesis, emphasizing efficiency and control. As models continue to evolve, this research points toward a compelling fusion of external data retrieval and advanced neural synthesis techniques, offering promising paths in AI-driven creative processes.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Robin Rombach (24 papers)
Andreas Blattmann (15 papers)
Björn Ommer (72 papers)

Citations (63)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - CompVis/latent-diffusion: High-Resolution Image Synthesis with Latent Diffusion Models (10,858 stars)