Papers
Topics
Authors
Recent
Search
2000 character limit reached

ColorConceptBench: Probabilistic Color Semantics

Updated 30 January 2026
  • ColorConceptBench is a large-scale benchmark that evaluates text-to-image models on their ability to map abstract color semantics to probabilistic color distributions.
  • It leverages a curated dataset of 1,281 implicit color concepts with rigorous annotation and quality controls, including pixel-level color quantization in CIELAB space.
  • Results reveal that current T2I architectures struggle with nuanced color shifts, highlighting the need for new training paradigms that mimic human-like color understanding.

ColorConceptBench is a large-scale, human-verified benchmark designed to evaluate how text-to-image (T2I) models ground implicit color semantics via probabilistic color-concept associations. Unlike previous benchmarks emphasizing explicit color names or codes, ColorConceptBench probes abstract modifiers and nuanced conceptual shifts by leveraging a richly annotated dataset and rigorously defined distributional metrics. The benchmark reveals fundamental shortcomings in current T2I architectures, underscoring the necessity for new training paradigms that incorporate human-like, probabilistic understanding of color-related abstract meaning (Ruan et al., 23 Jan 2026).

1. Dataset Construction and Annotation Principles

ColorConceptBench comprises 1,281 implicit color concepts—each formed by combining base nouns (e.g., “apple,” “lake,” “cabin,” taken from the THINGS dataset and filtered by COCA word frequency) with two classes of non-explicit modifiers:

  • Visual States (e.g., fresh, rotten, polluted)
  • Emotional Moods (e.g., cozy, lonely, oppressive)

Concept selection involved iterative expert curation and designer feedback, resulting in a lexicon discretized into eight object categories. For each concept, one clean sketch was selected (from five generated by @@@@3@@@@ and SD 3.5) for further annotation. Five professional designers independently colorized each sketch in accordance with semantic intuition, with instructions and controlled practice trials to maintain annotation consistency.

A rigorous three-stage quality-control protocol was enforced:

  1. Expert review on inter-round disagreement
  2. Pairwise agreement via averaged Earth Mover’s Distance (EMD):

EMD(c)=110i<jEMD(pi,pj)\overline{\mathrm{EMD}(c)} = \frac{1}{10} \sum_{i < j} \mathrm{EMD}(p_i, p_j)

  1. Blind expert verification for high-variance samples

Segmentation of concept regions utilized DINO and SAM. Pixel-level color quantization was performed in CIELAB (or UW71) space with adaptive clustering by CIE ΔE00\Delta E_{00} distances (merging colors with ΔE007\Delta E_{00} \leq 7). The probabilistic human color distribution for each concept kk is operationalized as:

P(ck)=N(c,k)cN(c,k)P(c \mid k) = \frac{N(c, k)}{\sum_{c'} N(c', k)}

where N(c,k)N(c, k) is the total count of assigned pixels in bin cc for concept kk.

2. Probabilistic Formulation of Color-Concept Semantic Distributions

Each concept kk is encoded by a discrete probability vector over KK color bins:

P(ck),c=1Kc=1KP(ck)=1P(c \mid k), \quad c = 1 \ldots K \qquad \sum_{c=1}^K P(c \mid k) = 1

Generated model images for concept kk are processed identically to extract matched model distributions:

Q(ck)=M(c,k)cM(c,k)Q(c \mid k) = \frac{M(c, k)}{\sum_{c'} M(c', k)}

This probabilistic framework enables nuanced comparison between the diversity and central tendency of human and model outputs, extending color-semantic evaluation beyond deterministic assignments.

3. Metrics for Distributional Alignment and Feature Fidelity

ColorConceptBench introduces a set of probabilistic and deterministic metrics for quantifying the alignment between human and model color-concept associations:

  • Pearson Correlation Coefficient (PCC):

PCC(P,Q)=c(PcPˉ)(QcQˉ)c(PcPˉ)2c(QcQˉ)2\mathrm{PCC}(P, Q) = \frac{ \sum_c (P_c - \bar{P})(Q_c - \bar{Q}) }{ \sqrt{ \sum_c (P_c - \bar{P})^2 } \sqrt{ \sum_c (Q_c - \bar{Q})^2 } }

  • Earth Mover’s Distance (EMD, in CIELAB space):

EMD(P,Q)=minfij0i,jfijdij\mathrm{EMD}(P, Q) = \min_{f_{ij} \geq 0} \sum_{i, j} f_{ij} d_{ij}

Subject to flow constraints: jfij=Pi\sum_j f_{ij} = P_i, ifij=Qj\sum_i f_{ij} = Q_j, with dijd_{ij} as binwise Euclidean distance.

  • Entropy Difference (ED):

ED(P,Q)=cPclogPc+cQclogQc\mathrm{ED}(P, Q) = \left| -\sum_c P_c \log P_c + \sum_c Q_c \log Q_c \right|

  • (Optional) Kullback-Leibler Divergence:

DKL(PQ)=cPclogPcQcD_{\mathrm{KL}}(P \parallel Q) = \sum_c P_c \log \frac{P_c}{Q_c}

(with smoothing for zero values).

  • Dominant Color Accuracy (DCA):

DCA=1Ck1(argmaxcP(ck)=argmaxcQ(ck))\mathrm{DCA} = \frac{1}{|\mathcal{C}|} \sum_k \mathbb{1}\left( \arg\max_c P(c \mid k) = \arg\max_c Q(c \mid k) \right)

  • Hue Angular Difference (Δ\DeltaHue):

ΔHue=1Ckmin(θkHθkM,360θkHθkM)\Delta\text{Hue} = \frac{1}{|\mathcal{C}|} \sum_k \min\left( |\theta_k^H - \theta_k^M|, 360^\circ - |\theta_k^H - \theta_k^M| \right)

where θkH\theta_k^H, θkM\theta_k^M are mean hue angles for human and model dominant bins.

4. Model Evaluation Protocols

Seven open-source T2I models were evaluated:

  • Stable Diffusion 3 (2B)
  • Stable Diffusion 3.5 (2.5B)
  • Stable Diffusion XL (2.6B)
  • SANA-1.5 (4.8B)
  • OmniGen2 (4.8B)
  • Flux.1-dev (12B)
  • Qwen-Image (20B)

For each concept, two visual styles (Natural photo, Clipart) and seven CFG scales (s{1.0,1.5,,7.0}s \in \{1.0, 1.5, \dots, 7.0\}) were tested, with five samples per (concept, style, ss) at 1024×10241024\times1024. Classifier-free guided diffusion followed:

ϵguided(xt)=ϵθ(xt,c)+s(ϵθ(xt,c)ϵθ(xt,))\epsilon_{\mathrm{guided}}(x_t) = \epsilon_\theta(x_t, c) + s (\epsilon_\theta(x_t, c) - \epsilon_\theta(x_t, \varnothing))

All images were segmented and quantized to obtain Q(ck)Q(c \mid k) in direct correspondence with human colorizations.

5. Benchmark Results and Critical Findings

Quantitative alignment between model and human distributions yielded the following ranges:

  • PCC: 0.62–0.74
  • EMD: 26–36
  • ED: 0.48–0.64

Clipart consistently outperformed Natural style; pure-noun prompts led to better model–human alignment than visual-state or emotional modifiers. The highest overall score was observed for SANA-1.5 (Clipart: PCC ≈ 0.743, EMD ≈ 26.55, ED ≈ 0.477).

Dominant color accuracy remains low (0.07–0.19), with Δ\DeltaHue discrepancies of 27°–42°, underscoring model difficulty in pinpointing peak color. A 2,700-vote human-model alignment study indicated EMD is best correlated with human judgments (Pearson ≈ 0.53, agreement 62%).

Qualitative analysis identified three typical model behaviors under adjective modification:

  1. Semantic inertia: Negligible color shift.
  2. Over-correction: Shift removes object identity.
  3. Precise adaptation: Appropriately balanced color modulation.

Model sensitivity to modifiers was sub-human: EMD shift intensity reached only ≈78% (Visual State) and ≈69% (Emotional Mood) of human baseline. Increasing CFG scale did not ameliorate color alignment; in fact, higher CFG generally increased EMD. Larger models (Qwen-Image 20B) did not show uniform superiority over smaller models (SD XL 2.6 B).

ColorConceptBench elucidates that current T2I models exhibit robust concrete object–color priors (e.g., “apple→red”) yet lack sensitivity for abstract or implicit modifiers. Parameter scaling and intensified guidance have no substantive benefit for closing this gap; implicit color semantics emerge as a fundamentally neglected axis in standard training regimes.

The authors advocate for a paradigm shift: model architectures and pretraining data must incorporate probabilistic, human-grounded color–concept distributions rather than point color labels. Possible extensions include embedding semantic-aware color–concept representation layers and collecting cross-cultural annotations to quantify symbolic variation.

A plausible implication is that integration of specialized models—such as ellipsoidal partitioning frameworks for color term categorization (Akbarinia et al., 2017) or integration of robust real-image editing benchmarks like COLORBENCH (Yin et al., 2024)—could enrich future evaluation pipelines by providing explicit ground truth for complex or cross-modal color associations.

7. Comparative Benchmarks and Research Context

ColorConceptBench advances the probabilistic evaluation paradigm relative to benchmarks such as ColorBench (Liang et al., 10 Apr 2025), which targets color perception, reasoning, and robustness in vision-LLMs. While ColorBench demonstrates the scaling law with minor performance gaps (best accuracy ≈57.8%, robustness ≤84.6%) and highlights benefit from larger LLMs over vision encoders, it centers on explicit color cues, extraction, and reasoning.

COLORBENCH (Yin et al., 2024) offers an object-focused real-image standard for evaluating color editing accuracy, relying on human-generated recolorings and six detailed metrics including SSIM, CLIP-Score, hue error, and LPIPS background consistency. Both ColorBench and COLORBENCH provide critical complementary perspectives on color-centric model capability, but neither addresses abstract probabilistic semantic shifts at the scale or semantic granularity of ColorConceptBench.

Collectively, these tools provide a comprehensive battery for T2I and multimodal AI model evaluation, but only ColorConceptBench directly interrogates the human-like mapping from nuanced concepts to color distributions, revealing a persistent and fundamental semantic deficiency in current generative systems.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ColorConceptBench.