Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 56 tok/s

Gemini 2.5 Pro 52 tok/s Pro

GPT-5 Medium 25 tok/s Pro

GPT-5 High 25 tok/s Pro

GPT-4o 75 tok/s Pro

Kimi K2 183 tok/s Pro

GPT OSS 120B 434 tok/s Pro

Claude Sonnet 4.5 35 tok/s Pro

2000 character limit reached

Redefining <Creative> in Dictionary: Towards an Enhanced Semantic Understanding of Creative Generation (2410.24160v3)

Published 31 Oct 2024 in cs.CV and cs.CL

Abstract: Creative'' remains an inherently abstract concept for both humans and diffusion models. While text-to-image (T2I) diffusion models can easily generate out-of-distribution concepts likea blue banana'', they struggle with generating combinatorial objects such as a creative mixture that resembles a lettuce and a mantis'', due to difficulties in understanding the semantic depth ofcreative''. Current methods rely heavily on synthesizing reference prompts or images to achieve a creative effect, typically requiring retraining for each unique creative output-a process that is computationally intensive and limits practical applications. To address this, we introduce CreTok, which brings meta-creativity to diffusion models by redefining ``creative'' as a new token, \texttt{<CreTok>}, thus enhancing models' semantic understanding for combinatorial creativity. CreTok achieves such redefinition by iteratively sampling diverse text pairs from our proposed CangJie dataset to form adaptive prompts and restrictive prompts, and then optimizing the similarity between their respective text embeddings. Extensive experiments demonstrate that <CreTok> enables the universal and direct generation of combinatorial creativity across diverse concepts without additional training, achieving state-of-the-art performance with improved text-image alignment and higher human preference ratings. Code will be made available at https://github.com/fu-feng/CreTok.

Summary

The paper introduces CreTok, a novel token (<CreTok>) designed to enhance semantic understanding for generating creative combinations in text-to-image diffusion models without requiring full model retraining.
CreTok utilizes the CangJie dataset and a training process that optimizes the semantic alignment between prompts containing <CreTok> and literal concept combinations, using a loss function based on cosine similarity.
Experimental results across quantitative metrics, GPT-4o evaluation, and user studies demonstrate CreTok's superior performance in generating creative, well-integrated, and aesthetically pleasing images compared to state-of-the-art models.

The paper "Redefining <Creative> in Dictionary: Towards an Enhanced Semantic Understanding of Creative Generation" (2410.24160) introduces CreTok, a novel approach to enhance the semantic understanding of combinatorial creativity in text-to-image diffusion models. CreTok redefines "^{^{^{^{1^{^{^{^"}}}}}}} as a new token, <CreTok>, enabling the direct generation of creative combinations without requiring model retraining.

CreTok Methodology

CreTok addresses the abstract nature of "creative" by representing it with a specialized token, <CreTok>. This token acts as a universal "adjective" applicable across various creative concept generations, describing how to generate creativity, unlike textual inversion or other token-based personalization methods that assign unique tokens to specific novel concepts.

CreTok leverages a specially constructed dataset called CangJie, designed for the Creative Text Pair to Object (TP2O) task. CangJie comprises diverse text pairs (e.g., "Lettuce," "Mantis") representing different concepts to be combined. The training process involves iteratively sampling a text pair $(t_1, t_2)$ from CangJie in each step. These sampled text pairs are then used to formulate two types of prompts:

Adaptive Prompt: A prompt incorporating <CreTok>, designed to elicit a creative combination (e.g., "A photo of a <CreTok> mixture").
Restrictive Prompt: A prompt combining the text pair literally (e.g., "A $t_1$ $t_2$ "). The order of $t_1$ and $t_2$ is alternated in different iterations to reduce bias toward one concept over the other.

The core of CreTok lies in optimizing the semantic alignment between the adaptive and restrictive prompts. The text embeddings of both prompts are obtained using a text encoder (e.g., CLIP). The similarity between these embeddings is calculated (using cosine similarity). The <CreTok> token is then iteratively refined to minimize the difference (or maximize the cosine similarity) between the text embeddings of these two prompts. This encourages the model to associate <CreTok> with the semantic transformation needed to combine the two concepts creatively, rather than just representing either concept alone.

The loss function is designed to maximize the cosine similarity between the text embeddings of the restrictive prompt (e.g., "a lettuce mantis") and the adaptive prompt (e.g., "a photo of a <CreTok> mixture."). A loss threshold $(\theta)$ is incorporated to prevent overfitting to individual concepts or excessive semantic carryover. The loss is adjusted to use the maximum of the cosine similarity and the threshold: $L_{mix} = 1 - max[cos(E(P_r(t_1, t_2)), E(P_a)), \theta]$ . Experimentation found an optimal threshold value $(\theta=0.5)$ which balances between capturing semantic representations and promoting combinatorial object generalization.

Experimental Results

The experimental results demonstrate the effectiveness of CreTok in generating creative combinations, as evidenced by quantitative metrics, GPT-4o evaluations, and user studies.

Quantitative Metrics

VQAScore: CreTok achieves a VQAScore of 0.835, outperforming Stable Diffusion 3 (0.793), Stable Diffusion 3.5 (0.805), and Kandinsky 3 (0.771), indicating better text-image alignment in combinatorial generation. It also outperforms BASS(0.710).
PickScore: CreTok achieves a PickScore of 21.775, showing slightly higher human preference ratings compared to Stable Diffusion 3 (21.716) and Kandinsky 3 (21.637) and it roughly equals Stable Diffusion 3.5 (21.766). It also outperforms BASS (20.799).
ImageReward: CreTok achieves an ImageReward of 1.065, significantly higher than Stable Diffusion 3 (0.896), Stable Diffusion 3.5 (0.881), Kandinsky 3 (0.634) and BASS (0.481), again indicating stronger alignment with aesthetic standards and human preferences.

GPT-4o Evaluation

GPT-4o was used to assess the creativity of generated images across four dimensions:

Integration: CreTok scored 9.5, higher than Stable Diffusion 3 (8.1), Kandinsky 3 (8.9), Stable Diffusion 3.5 (9.1), and BASS (8.9), indicating superior conceptual integration.
Alignment: CreTok scored 9.9, equalling Stable Diffusion 3.5 and higher than Stable Diffusion 3 (8.7), Kandinsky 3 (9.7), and BASS (9.3), showing superior alignment with the provided prompts.
Originality: CreTok scored 9.3, higher than Stable Diffusion 3 (8.2), Kandinsky 3 (9.0), Stable Diffusion 3.5 (9.1) and BASS (8.7), demonstrating greater innovativeness in the generated concepts.
Aesthetics: CreTok scored 9.6, higher than Stable Diffusion 3 (9.0), Kandinsky 3 (9.2), Stable Diffusion 3.5 (9.4) and BASS (8.3), showing better visual appeal.

User Study

A user paper with 50 participants ranked the creativity of images generated by CreTok compared to other methods. CreTok achieved an average rank of 1.9, significantly outperforming Stable Diffusion 3 (3.4), Stable Diffusion 3.5 (3.1), Kandinsky 3 (3.3) and BASS (3.1), highlighting its superior perceived creativity.