Stylus: Automatic Adapter Selection for Diffusion Models

Published 29 Apr 2024 in cs.CV, cs.AI, cs.CL, cs.GR, and cs.LG | (2404.18928v1)

Abstract: Beyond scaling base models with more data or parameters, fine-tuned adapters provide an alternative way to generate high fidelity, custom images at reduced costs. As such, adapters have been widely adopted by open-source communities, accumulating a database of over 100K adapters-most of which are highly customized with insufficient descriptions. This paper explores the problem of matching the prompt to a set of relevant adapters, built on recent work that highlight the performance gains of composing adapters. We introduce Stylus, which efficiently selects and automatically composes task-specific adapters based on a prompt's keywords. Stylus outlines a three-stage approach that first summarizes adapters with improved descriptions and embeddings, retrieves relevant adapters, and then further assembles adapters based on prompts' keywords by checking how well they fit the prompt. To evaluate Stylus, we developed StylusDocs, a curated dataset featuring 75K adapters with pre-computed adapter embeddings. In our evaluation on popular Stable Diffusion checkpoints, Stylus achieves greater CLIP-FID Pareto efficiency and is twice as preferred, with humans and multimodal models as evaluators, over the base model. See stylus-diffusion.github.io for more.

Abstract PDF Upgrade to Chat

Authors (8)

Citations (2)

View on Semantic Scholar

Summary

The paper introduces a novel three-stage method—comprising Refiner, Retriever, and Composer—that efficiently selects and composes task-relevant adapters for image generation.
It uses cosine similarity and language model refinements to significantly improve image quality, achieving up to 2x higher user preference ratings than baseline methods.
Stylus offers cross-domain applicability, extending from text-to-image generation to complex tasks like inpainting and style translation while maintaining high visual fidelity.

Automatic Adapter Selection: Enhancing Image Generation with "Stylus"

Introduction to Stylus

The growth of adapter-based models has paved the way for custom image generation with significant storage and computational efficiency. Despite the abundance of adapters surpassing 100,000, challenges persist in their effective utilization due to the predominance of custom adapters with scant documentation. "Stylus" emerges as a solution, tailored to efficiently select and compose task-specific adapters for image generation based on the context and keywords of user-provided prompts.

Stylus: Methodology Overview

Summary of Approach

"Stylus" integrates a three-stage mechanism for its operation:

Refiner: Enhances adapter descriptions and converts them into embeddings using a vision-LLM (VLM) combined with a text encoder.
Retriever: Scores adapters based on their relevance to the entire user prompt, efficiently fetching the most pertinent ones.
Composer: Segments the prompt into discrete tasks, further pruning and categorizing the adapters, ensuring optimal relevance and minimal bias insertion.

These stages collectively improve the retrieval and application of adapters, enhancing image diversity and visual fidelity.

Technical Details

The Refiner extracts robust descriptions from model cards using VLM, then creates embeddings with the text-embedding-3-large model.
The Retriever employs cosine similarity metrics to sift through adapters and select the top candidates directly related to a user's prompt.
The Composer, leveraging an LLM, effectively maps adapters to specific tasks derived from the prompt, thereby curating relevant adapters and applying a binary mask for diverse image generation.

Performance and Evaluation

Quantitative Assessments

"Stylus" has been rigorously tested on the curated dataset StylusDocs featuring 75K Low-Rank Adapters (LoRA). It demonstrates marked improvements in CLIP/FID Pareto efficiency and user preference ratings:

Achieving up to two times higher preference in comparative human evaluations.
Showing significant enhancements in both visual fidelity (FID scores) and textual alignment (CLIP scores).

Method Comparisons

When compared to existing methods like RAG (Retrieval-Augmented Generation) and simple random adapter sampling, "Stylus" markedly surpasses these in maintaining thematic coherence and image quality, exempt from irrelevant biases which other methods occasionally introduce.

Practical Applications and Future Prospects

Stylus not only serves image generation from text prompts but extends to complex image-to-image tasks such as inpainting and style translation. This adaptability showcases the model’s potential cross-domain utility, applicable to various generative tasks and possibly extending to video generation domains.

Conclusion

"Stylus" redefines the adapter selection process for diffusion models, offering a sophisticated yet efficient system to harness the vast repertoire of adapters for tailored image creation. It proves especially pivotal in environments with numerous specialized adapters, aligning the adapters' capabilities closely with user-specific requirements to produce high-quality, diverse imagery. Embodied in "Stylus" is the promise of a finer, adaptive approach to generative model enhancement, inspiring future innovations across different facets of generative AI.

Markdown Report Issue