Alchemist: Turning Public Text-to-Image Data into Generative Gold (2505.19297v1)

Published 25 May 2025 in cs.CV

Abstract: Pre-training equips text-to-image (T2I) models with broad world knowledge, but this alone is often insufficient to achieve high aesthetic quality and alignment. Consequently, supervised fine-tuning (SFT) is crucial for further refinement. However, its effectiveness highly depends on the quality of the fine-tuning dataset. Existing public SFT datasets frequently target narrow domains (e.g., anime or specific art styles), and the creation of high-quality, general-purpose SFT datasets remains a significant challenge. Current curation methods are often costly and struggle to identify truly impactful samples. This challenge is further complicated by the scarcity of public general-purpose datasets, as leading models often rely on large, proprietary, and poorly documented internal data, hindering broader research progress. This paper introduces a novel methodology for creating general-purpose SFT datasets by leveraging a pre-trained generative model as an estimator of high-impact training samples. We apply this methodology to construct and release Alchemist, a compact (3,350 samples) yet highly effective SFT dataset. Experiments demonstrate that Alchemist substantially improves the generative quality of five public T2I models while preserving diversity and style. Additionally, we release the fine-tuned models' weights to the public.

PDF Abstract

Overview of Alchemist: Turning Public Text-to-Image Data into Generative Gold

The research paper titled "Alchemist: Turning Public Text-to-Image Data into Generative Gold" introduces an innovative approach to refining text-to-image (T2I) generative models through supervised fine-tuning (SFT). The paper addresses the limitations of existing public datasets, which often focus on narrow domains, and proposes a methodology for curating high-quality, general-purpose datasets. The authors present Alchemist, a compact yet highly effective SFT dataset comprising 3,350 samples, demonstrating substantial improvements in generative quality across multiple T2I models.

Core Contributions

Dataset Curation Methodology: The paper's primary contribution lies in its novel dataset curation process. The authors leverage a pre-trained generative model to estimate high-impact data samples that significantly enhance model performance post-SFT. This approach identifies samples most likely to improve generative quality without sacrificing diversity or style.
Alchemist Dataset: By applying the proposed curation methodology, the authors construct and release the Alchemist dataset, specifically designed to optimize T2I models' generative capabilities. Its compact size (3,350 samples) contrasts with the typically large, proprietary datasets, providing a valuable resource for reproducible research.
Empirical Evaluation and Findings: Experiments conducted by the authors show that Alchemist enhances the generative quality of five public T2I models, namely DALL-E 3, Imagen 3, SDXL, SD3.5 Medium, and SD3.5 Large. The dataset's impact is validated through human evaluation and automated metrics, confirming improved aesthetic quality and complexity while maintaining alignment with prompts.

Experimental Setup and Results

The authors undertook a comprehensive experimental setup to evaluate the effectiveness of Alchemist. They applied full fine-tuning techniques to several pre-trained T2I models, comparing baseline model weights and alternative fine-tuning datasets. Human evaluators assessed generated images on criteria such as image-text relevance, aesthetic quality, image complexity, and fidelity. Automated metrics, including Fréchet Distance using DINOv2 features and CLIP Score, complemented these assessments.

The findings indicate significant enhancements in aesthetic quality and complexity, with Alchemist-tuned models outperforming both baseline models and LAION-Aesthetics tuned models in most aspects. Although improvements in image-text relevance were minimal, the structured fine-tuning approach with Alchemist contributed to bridging performance gaps between traditional models and state-of-the-art solutions.

Implications and Future Directions

The paper provides critical insights into the role of dataset quality in SFT for T2I models. By demonstrating a principled approach to dataset curation, the authors offer an open-source alternative to proprietary, closed datasets, facilitating further research and commercial applications in generative AI.

Looking forward, the research underscores potential areas for enhancing complexity and fidelity in image generation, considering trade-offs involved in complexity-rich outputs. As the community explores generative models' practical applications, the Alchemist dataset presents a foundation for systematic improvement in text-to-image synthesis.

In conclusion, "Alchemist: Turning Public Text-to-Image Data into Generative Gold" represents a noteworthy advancement in the field of generative AI, providing a compact, high-quality dataset that supports robust improvements in T2I model performance. This research empowers further exploration and innovation in aesthetic quality enhancement, contributing to broader progress in AI-powered visual content generation.