- The paper introduces CreTok, a novel token (<CreTok>) designed to enhance semantic understanding for generating creative combinations in text-to-image diffusion models without requiring full model retraining.
- CreTok utilizes the CangJie dataset and a training process that optimizes the semantic alignment between prompts containing <CreTok> and literal concept combinations, using a loss function based on cosine similarity.
- Experimental results across quantitative metrics, GPT-4o evaluation, and user studies demonstrate CreTok's superior performance in generating creative, well-integrated, and aesthetically pleasing images compared to state-of-the-art models.
The paper "Redefining <Creative> in Dictionary: Towards an Enhanced Semantic Understanding of Creative Generation" (2410.24160) introduces CreTok, a novel approach to enhance the semantic understanding of combinatorial creativity in text-to-image diffusion models. CreTok redefines "1" as a new token, <CreTok>
, enabling the direct generation of creative combinations without requiring model retraining.
CreTok Methodology
CreTok addresses the abstract nature of "creative" by representing it with a specialized token, <CreTok>
. This token acts as a universal "adjective" applicable across various creative concept generations, describing how to generate creativity, unlike textual inversion or other token-based personalization methods that assign unique tokens to specific novel concepts.
CreTok leverages a specially constructed dataset called CangJie, designed for the Creative Text Pair to Object (TP2O) task. CangJie comprises diverse text pairs (e.g., "Lettuce," "Mantis") representing different concepts to be combined. The training process involves iteratively sampling a text pair (t1,t2) from CangJie in each step. These sampled text pairs are then used to formulate two types of prompts:
- Adaptive Prompt: A prompt incorporating
<CreTok>
, designed to elicit a creative combination (e.g., "A photo of a <CreTok>
mixture").
- Restrictive Prompt: A prompt combining the text pair literally (e.g., "A t1 t2"). The order of t1 and t2 is alternated in different iterations to reduce bias toward one concept over the other.
The core of CreTok lies in optimizing the semantic alignment between the adaptive and restrictive prompts. The text embeddings of both prompts are obtained using a text encoder (e.g., CLIP). The similarity between these embeddings is calculated (using cosine similarity). The <CreTok>
token is then iteratively refined to minimize the difference (or maximize the cosine similarity) between the text embeddings of these two prompts. This encourages the model to associate <CreTok>
with the semantic transformation needed to combine the two concepts creatively, rather than just representing either concept alone.
The loss function is designed to maximize the cosine similarity between the text embeddings of the restrictive prompt (e.g., "a lettuce mantis") and the adaptive prompt (e.g., "a photo of a <CreTok>
mixture."). A loss threshold (θ) is incorporated to prevent overfitting to individual concepts or excessive semantic carryover. The loss is adjusted to use the maximum of the cosine similarity and the threshold: Lmix=1−max[cos(E(Pr(t1,t2)),E(Pa)),θ]. Experimentation found an optimal threshold value (θ=0.5) which balances between capturing semantic representations and promoting combinatorial object generalization.
Experimental Results
The experimental results demonstrate the effectiveness of CreTok in generating creative combinations, as evidenced by quantitative metrics, GPT-4o evaluations, and user studies.
Quantitative Metrics
- VQAScore: CreTok achieves a VQAScore of 0.835, outperforming Stable Diffusion 3 (0.793), Stable Diffusion 3.5 (0.805), and Kandinsky 3 (0.771), indicating better text-image alignment in combinatorial generation. It also outperforms BASS(0.710).
- PickScore: CreTok achieves a PickScore of 21.775, showing slightly higher human preference ratings compared to Stable Diffusion 3 (21.716) and Kandinsky 3 (21.637) and it roughly equals Stable Diffusion 3.5 (21.766). It also outperforms BASS (20.799).
- ImageReward: CreTok achieves an ImageReward of 1.065, significantly higher than Stable Diffusion 3 (0.896), Stable Diffusion 3.5 (0.881), Kandinsky 3 (0.634) and BASS (0.481), again indicating stronger alignment with aesthetic standards and human preferences.
GPT-4o Evaluation
GPT-4o was used to assess the creativity of generated images across four dimensions:
- Integration: CreTok scored 9.5, higher than Stable Diffusion 3 (8.1), Kandinsky 3 (8.9), Stable Diffusion 3.5 (9.1), and BASS (8.9), indicating superior conceptual integration.
- Alignment: CreTok scored 9.9, equalling Stable Diffusion 3.5 and higher than Stable Diffusion 3 (8.7), Kandinsky 3 (9.7), and BASS (9.3), showing superior alignment with the provided prompts.
- Originality: CreTok scored 9.3, higher than Stable Diffusion 3 (8.2), Kandinsky 3 (9.0), Stable Diffusion 3.5 (9.1) and BASS (8.7), demonstrating greater innovativeness in the generated concepts.
- Aesthetics: CreTok scored 9.6, higher than Stable Diffusion 3 (9.0), Kandinsky 3 (9.2), Stable Diffusion 3.5 (9.4) and BASS (8.3), showing better visual appeal.
User Study
A user paper with 50 participants ranked the creativity of images generated by CreTok compared to other methods. CreTok achieved an average rank of 1.9, significantly outperforming Stable Diffusion 3 (3.4), Stable Diffusion 3.5 (3.1), Kandinsky 3 (3.3) and BASS (3.1), highlighting its superior perceived creativity.