ColorConceptBench: Probabilistic Color Semantics

Updated 30 January 2026

ColorConceptBench is a large-scale benchmark that evaluates text-to-image models on their ability to map abstract color semantics to probabilistic color distributions.
It leverages a curated dataset of 1,281 implicit color concepts with rigorous annotation and quality controls, including pixel-level color quantization in CIELAB space.
Results reveal that current T2I architectures struggle with nuanced color shifts, highlighting the need for new training paradigms that mimic human-like color understanding.

ColorConceptBench is a large-scale, human-verified benchmark designed to evaluate how text-to-image (T2I) models ground implicit color semantics via probabilistic color-concept associations. Unlike previous benchmarks emphasizing explicit color names or codes, ColorConceptBench probes abstract modifiers and nuanced conceptual shifts by leveraging a richly annotated dataset and rigorously defined distributional metrics. The benchmark reveals fundamental shortcomings in current T2I architectures, underscoring the necessity for new training paradigms that incorporate human-like, probabilistic understanding of color-related abstract meaning (Ruan et al., 23 Jan 2026).

1. Dataset Construction and Annotation Principles

ColorConceptBench comprises 1,281 implicit color concepts—each formed by combining base nouns (e.g., “apple,” “lake,” “cabin,” taken from the THINGS dataset and filtered by COCA word frequency) with two classes of non-explicit modifiers:

Visual States (e.g., fresh, rotten, polluted)
Emotional Moods (e.g., cozy, lonely, oppressive)

Concept selection involved iterative expert curation and designer feedback, resulting in a lexicon discretized into eight object categories. For each concept, one clean sketch was selected (from five generated by Qwen-Image and SD 3.5) for further annotation. Five professional designers independently colorized each sketch in accordance with semantic intuition, with instructions and controlled practice trials to maintain annotation consistency.

A rigorous three-stage quality-control protocol was enforced:

Expert review on inter-round disagreement
Pairwise agreement via averaged Earth Mover’s Distance (EMD):

$\overline{\mathrm{EMD}(c)} = \frac{1}{10} \sum_{i < j} \mathrm{EMD}(p_i, p_j)$

Blind expert verification for high-variance samples

Segmentation of concept regions utilized DINO and SAM. Pixel-level color quantization was performed in CIELAB (or UW71) space with adaptive clustering by CIE $\Delta E_{00}$ distances (merging colors with $\Delta E_{00} \leq 7$ ). The probabilistic human color distribution for each concept $k$ is operationalized as:

$P(c \mid k) = \frac{N(c, k)}{\sum_{c'} N(c', k)}$

where $N(c, k)$ is the total count of assigned pixels in bin $c$ for concept $k$ .

2. Probabilistic Formulation of Color-Concept Semantic Distributions

Each concept $k$ is encoded by a discrete probability vector over $K$ color bins:

$\Delta E_{00}$ 0

Generated model images for concept $\Delta E_{00}$ 1 are processed identically to extract matched model distributions:

$\Delta E_{00}$ 2

This probabilistic framework enables nuanced comparison between the diversity and central tendency of human and model outputs, extending color-semantic evaluation beyond deterministic assignments.

3. Metrics for Distributional Alignment and Feature Fidelity

ColorConceptBench introduces a set of probabilistic and deterministic metrics for quantifying the alignment between human and model color-concept associations:

Pearson Correlation Coefficient (PCC):

$\Delta E_{00}$ 3

Earth Mover’s Distance (EMD, in CIELAB space):

$\Delta E_{00}$ 4

Subject to flow constraints: $\Delta E_{00}$ 5, $\Delta E_{00}$ 6, with $\Delta E_{00}$ 7 as binwise Euclidean distance.

Entropy Difference (ED):

$\Delta E_{00}$ 8

(Optional) Kullback-Leibler Divergence:

$\Delta E_{00}$ 9

(with smoothing for zero values).

Dominant Color Accuracy (DCA):

$\Delta E_{00} \leq 7$ 0

Hue Angular Difference ( $\Delta E_{00} \leq 7$ 1Hue):

$\Delta E_{00} \leq 7$ 2

where $\Delta E_{00} \leq 7$ 3, $\Delta E_{00} \leq 7$ 4 are mean hue angles for human and model dominant bins.

4. Model Evaluation Protocols

Seven open-source T2I models were evaluated:

Stable Diffusion 3 (2B)
Stable Diffusion 3.5 (2.5B)
Stable Diffusion XL (2.6B)
SANA-1.5 (4.8B)
OmniGen2 (4.8B)
Flux.1-dev (12B)
Qwen-Image (20B)

For each concept, two visual styles (Natural photo, Clipart) and seven CFG scales ( $\Delta E_{00} \leq 7$ 5) were tested, with five samples per (concept, style, $\Delta E_{00} \leq 7$ 6) at $\Delta E_{00} \leq 7$ 7. Classifier-free guided diffusion followed:

$\Delta E_{00} \leq 7$ 8

All images were segmented and quantized to obtain $\Delta E_{00} \leq 7$ 9 in direct correspondence with human colorizations.

5. Benchmark Results and Critical Findings

Quantitative alignment between model and human distributions yielded the following ranges:

PCC: 0.62–0.74
EMD: 26–36
ED: 0.48–0.64

Clipart consistently outperformed Natural style; pure-noun prompts led to better model–human alignment than visual-state or emotional modifiers. The highest overall score was observed for SANA-1.5 (Clipart: PCC ≈ 0.743, EMD ≈ 26.55, ED ≈ 0.477).

Dominant color accuracy remains low (0.07–0.19), with $k$ 0Hue discrepancies of 27°–42°, underscoring model difficulty in pinpointing peak color. A 2,700-vote human-model alignment study indicated EMD is best correlated with human judgments (Pearson ≈ 0.53, agreement 62%).

Qualitative analysis identified three typical model behaviors under adjective modification:

Semantic inertia: Negligible color shift.
Over-correction: Shift removes object identity.
Precise adaptation: Appropriately balanced color modulation.

Model sensitivity to modifiers was sub-human: EMD shift intensity reached only ≈78% (Visual State) and ≈69% (Emotional Mood) of human baseline. Increasing CFG scale did not ameliorate color alignment; in fact, higher CFG generally increased EMD. Larger models (Qwen-Image 20B) did not show uniform superiority over smaller models (SD XL 2.6 B).

6. Implications, Limitations, and Recommended Directions

ColorConceptBench elucidates that current T2I models exhibit robust concrete object–color priors (e.g., “apple→red”) yet lack sensitivity for abstract or implicit modifiers. Parameter scaling and intensified guidance have no substantive benefit for closing this gap; implicit color semantics emerge as a fundamentally neglected axis in standard training regimes.

The authors advocate for a paradigm shift: model architectures and pretraining data must incorporate probabilistic, human-grounded color–concept distributions rather than point color labels. Possible extensions include embedding semantic-aware color–concept representation layers and collecting cross-cultural annotations to quantify symbolic variation.

A plausible implication is that integration of specialized models—such as ellipsoidal partitioning frameworks for color term categorization (Akbarinia et al., 2017) or integration of robust real-image editing benchmarks like COLORBENCH (Yin et al., 2024)—could enrich future evaluation pipelines by providing explicit ground truth for complex or cross-modal color associations.

7. Comparative Benchmarks and Research Context

ColorConceptBench advances the probabilistic evaluation paradigm relative to benchmarks such as ColorBench (Liang et al., 10 Apr 2025), which targets color perception, reasoning, and robustness in vision-LLMs. While ColorBench demonstrates the scaling law with minor performance gaps (best accuracy ≈57.8%, robustness ≤84.6%) and highlights benefit from larger LLMs over vision encoders, it centers on explicit color cues, extraction, and reasoning.

COLORBENCH (Yin et al., 2024) offers an object-focused real-image standard for evaluating color editing accuracy, relying on human-generated recolorings and six detailed metrics including SSIM, CLIP-Score, hue error, and LPIPS background consistency. Both ColorBench and COLORBENCH provide critical complementary perspectives on color-centric model capability, but neither addresses abstract probabilistic semantic shifts at the scale or semantic granularity of ColorConceptBench.

Collectively, these tools provide a comprehensive battery for T2I and multimodal AI model evaluation, but only ColorConceptBench directly interrogates the human-like mapping from nuanced concepts to color distributions, revealing a persistent and fundamental semantic deficiency in current generative systems.

Markdown Report Issue Upgrade to Chat

References (4)

ColorConceptBench: A Benchmark for Probabilistic Color-Concept Understanding in Text-to-Image Models (2026)

Colour Terms: a Categorisation Model Inspired by Visual Cortex Neurons (2017)

ColorEdit: Training-free Image-Guided Color editing with diffusion model (2024)

ColorBench: Can VLMs See and Understand the Colorful World? A Comprehensive Benchmark for Color Perception, Reasoning, and Robustness (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ColorConceptBench.

ColorConceptBench: Probabilistic Color Semantics

1. Dataset Construction and Annotation Principles

2. Probabilistic Formulation of Color-Concept Semantic Distributions

3. Metrics for Distributional Alignment and Feature Fidelity

4. Model Evaluation Protocols

5. Benchmark Results and Critical Findings

6. Implications, Limitations, and Recommended Directions

7. Comparative Benchmarks and Research Context

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

ColorConceptBench: Probabilistic Color Semantics

1. Dataset Construction and Annotation Principles

2. Probabilistic Formulation of Color-Concept Semantic Distributions

3. Metrics for Distributional Alignment and Feature Fidelity

4. Model Evaluation Protocols

5. Benchmark Results and Critical Findings

6. Implications, Limitations, and Recommended Directions

7. Comparative Benchmarks and Research Context

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research