CommonCanvas: An Open Diffusion Model Trained with Creative-Commons Images (2310.16825v1)

Published 25 Oct 2023 in cs.CV and cs.CY

Abstract: We assemble a dataset of Creative-Commons-licensed (CC) images, which we use to train a set of open diffusion models that are qualitatively competitive with Stable Diffusion 2 (SD2). This task presents two challenges: (1) high-resolution CC images lack the captions necessary to train text-to-image generative models; (2) CC images are relatively scarce. In turn, to address these challenges, we use an intuitive transfer learning technique to produce a set of high-quality synthetic captions paired with curated CC images. We then develop a data- and compute-efficient training recipe that requires as little as 3% of the LAION-2B data needed to train existing SD2 models, but obtains comparable quality. These results indicate that we have a sufficient number of CC images (~70 million) for training high-quality models. Our training recipe also implements a variety of optimizations that achieve ~3X training speed-ups, enabling rapid model iteration. We leverage this recipe to train several high-quality text-to-image models, which we dub the CommonCanvas family. Our largest model achieves comparable performance to SD2 on a human evaluation, despite being trained on our CC dataset that is significantly smaller than LAION and using synthetic captions for training. We release our models, data, and code at https://github.com/mosaicml/diffusion/blob/main/assets/common-canvas.md

PDF Abstract

An Overview of "CommonCanvas: An Open Diffusion Model Trained with Creative-Commons Images"

The paper "CommonCanvas: An Open Diffusion Model Trained with Creative-Commons Images," authored by researchers from Cornell Tech and Databricks Mosaic, presents a text-to-image (T2I) model, CommonCanvas, trained exclusively on Creative-Commons (CC) licensed images. This unique approach poses significant challenges due to the lack of captions and the relative scarcity of high-resolution CC images compared to datasets like LAION-2B. The authors address these challenges leveraging transfer learning techniques and introducing several training optimizations to create a model competitive with Stable Diffusion 2 (SD2).

Motivations and Objectives

Training high-quality T2I models typically requires vast datasets containing paired image-caption data. Historically, datasets like LAION-2B, derived from web-scraped data, have served this purpose. However, such practices raise legal and reproducibility issues. The authors aim to explore whether it is feasible to develop a competitive T2I model using only openly licensed, Creative-Commons images, sidestepping potential copyright issues and enhancing dataset reproducibility.

Methodology

Addressing Data Incompleteness: Synthetic Captioning via Telephoning

A significant issue in using CC images is the lack of accompanying captions necessary for T2I model training. To tackle this, the authors employ a transfer learning technique termed "telephoning," leveraging the pre-trained BLIP-2 model. BLIP-2 generates high-quality synthetic captions from raw CC images, forming the basis of their new dataset, CommonCatalog. This method effectively uses the vast knowledge encoded in pre-trained models to produce necessary labels for otherwise unlabeled datasets.

Handling Data Scarcity: Efficient Training Techniques

The authors delve into the effects of dataset size on model performance, hypothesizing that far less data than what LAION-2B offers might suffice. They conduct a series of experiments using smaller, random subsets of LAION-2B, demonstrating that competitive performance with SD2 can be achieved using only about 3% of the training data. A variety of optimizations—including Flash Attention, latent pre-computation, and precision adjustments—are applied to achieve a 2.71X training speedup, enabling more efficient experimentation.

Results and Evaluation

Comparison with Stable Diffusion 2

The CommonCanvas models are rigorously evaluated against SD2 using both quantitative metrics (FID, KID, CLIP-FID) and human evaluations. The paper reports that their largest model, CommonCanvas-LNC, achieves performance comparable to SD2 on human evaluations of Parti Prompts, despite being trained on a dataset less than 3% the size of LAION-2B. Notably, the CommonCanvas models exhibit robustness in avoiding the generation of copyrighted characters and iconic figures, showcasing a potential advantage over models trained on broader, less-regulated datasets.

Practical and Theoretical Implications

The practical implications of this research are significant. By demonstrating that competitive T2I models can be trained on openly licensed data, the work mitigates legal risks associated with copyrighted material. From a theoretical standpoint, the success of training on much smaller datasets suggests that current models might be under-parameterized, opening avenues for future research in model architecture and efficiency improvements.

Future Directions

The authors suggest that future work could involve augmenting CommonCatalog with additional CC images from various sources, allowing for even larger open datasets. Additionally, they hint at exploring more advanced and larger model architectures, further pushing the boundaries of what is achievable with openly licensed data.

Conclusion

The paper "CommonCanvas: An Open Diffusion Model Trained with Creative-Commons Images" provides an insightful examination of the feasibility of training high-quality T2I models on openly licensed datasets. Through innovative transfer learning techniques and efficient training optimizations, the authors successfully develop a model that stands toe-to-toe with state-of-the-art models, while steering clear of potential copyright infringements. This work not only contributes to the field by proposing a practical solution to a significant issue but also sets a precedent for future research in the field of ethically sourced AI training data.