ColorConceptBench: Probabilistic Color Semantics
- ColorConceptBench is a large-scale benchmark that evaluates text-to-image models on their ability to map abstract color semantics to probabilistic color distributions.
- It leverages a curated dataset of 1,281 implicit color concepts with rigorous annotation and quality controls, including pixel-level color quantization in CIELAB space.
- Results reveal that current T2I architectures struggle with nuanced color shifts, highlighting the need for new training paradigms that mimic human-like color understanding.
ColorConceptBench is a large-scale, human-verified benchmark designed to evaluate how text-to-image (T2I) models ground implicit color semantics via probabilistic color-concept associations. Unlike previous benchmarks emphasizing explicit color names or codes, ColorConceptBench probes abstract modifiers and nuanced conceptual shifts by leveraging a richly annotated dataset and rigorously defined distributional metrics. The benchmark reveals fundamental shortcomings in current T2I architectures, underscoring the necessity for new training paradigms that incorporate human-like, probabilistic understanding of color-related abstract meaning (Ruan et al., 23 Jan 2026).
1. Dataset Construction and Annotation Principles
ColorConceptBench comprises 1,281 implicit color concepts—each formed by combining base nouns (e.g., “apple,” “lake,” “cabin,” taken from the THINGS dataset and filtered by COCA word frequency) with two classes of non-explicit modifiers:
- Visual States (e.g., fresh, rotten, polluted)
- Emotional Moods (e.g., cozy, lonely, oppressive)
Concept selection involved iterative expert curation and designer feedback, resulting in a lexicon discretized into eight object categories. For each concept, one clean sketch was selected (from five generated by @@@@3@@@@ and SD 3.5) for further annotation. Five professional designers independently colorized each sketch in accordance with semantic intuition, with instructions and controlled practice trials to maintain annotation consistency.
A rigorous three-stage quality-control protocol was enforced:
- Expert review on inter-round disagreement
- Pairwise agreement via averaged Earth Mover’s Distance (EMD):
- Blind expert verification for high-variance samples
Segmentation of concept regions utilized DINO and SAM. Pixel-level color quantization was performed in CIELAB (or UW71) space with adaptive clustering by CIE distances (merging colors with ). The probabilistic human color distribution for each concept is operationalized as:
where is the total count of assigned pixels in bin for concept .
2. Probabilistic Formulation of Color-Concept Semantic Distributions
Each concept is encoded by a discrete probability vector over color bins:
Generated model images for concept are processed identically to extract matched model distributions:
This probabilistic framework enables nuanced comparison between the diversity and central tendency of human and model outputs, extending color-semantic evaluation beyond deterministic assignments.
3. Metrics for Distributional Alignment and Feature Fidelity
ColorConceptBench introduces a set of probabilistic and deterministic metrics for quantifying the alignment between human and model color-concept associations:
- Pearson Correlation Coefficient (PCC):
- Earth Mover’s Distance (EMD, in CIELAB space):
Subject to flow constraints: , , with as binwise Euclidean distance.
- Entropy Difference (ED):
- (Optional) Kullback-Leibler Divergence:
(with smoothing for zero values).
- Dominant Color Accuracy (DCA):
- Hue Angular Difference (Hue):
where , are mean hue angles for human and model dominant bins.
4. Model Evaluation Protocols
Seven open-source T2I models were evaluated:
- Stable Diffusion 3 (2B)
- Stable Diffusion 3.5 (2.5B)
- Stable Diffusion XL (2.6B)
- SANA-1.5 (4.8B)
- OmniGen2 (4.8B)
- Flux.1-dev (12B)
- Qwen-Image (20B)
For each concept, two visual styles (Natural photo, Clipart) and seven CFG scales () were tested, with five samples per (concept, style, ) at . Classifier-free guided diffusion followed:
All images were segmented and quantized to obtain in direct correspondence with human colorizations.
5. Benchmark Results and Critical Findings
Quantitative alignment between model and human distributions yielded the following ranges:
- PCC: 0.62–0.74
- EMD: 26–36
- ED: 0.48–0.64
Clipart consistently outperformed Natural style; pure-noun prompts led to better model–human alignment than visual-state or emotional modifiers. The highest overall score was observed for SANA-1.5 (Clipart: PCC ≈ 0.743, EMD ≈ 26.55, ED ≈ 0.477).
Dominant color accuracy remains low (0.07–0.19), with Hue discrepancies of 27°–42°, underscoring model difficulty in pinpointing peak color. A 2,700-vote human-model alignment study indicated EMD is best correlated with human judgments (Pearson ≈ 0.53, agreement 62%).
Qualitative analysis identified three typical model behaviors under adjective modification:
- Semantic inertia: Negligible color shift.
- Over-correction: Shift removes object identity.
- Precise adaptation: Appropriately balanced color modulation.
Model sensitivity to modifiers was sub-human: EMD shift intensity reached only ≈78% (Visual State) and ≈69% (Emotional Mood) of human baseline. Increasing CFG scale did not ameliorate color alignment; in fact, higher CFG generally increased EMD. Larger models (Qwen-Image 20B) did not show uniform superiority over smaller models (SD XL 2.6 B).
6. Implications, Limitations, and Recommended Directions
ColorConceptBench elucidates that current T2I models exhibit robust concrete object–color priors (e.g., “apple→red”) yet lack sensitivity for abstract or implicit modifiers. Parameter scaling and intensified guidance have no substantive benefit for closing this gap; implicit color semantics emerge as a fundamentally neglected axis in standard training regimes.
The authors advocate for a paradigm shift: model architectures and pretraining data must incorporate probabilistic, human-grounded color–concept distributions rather than point color labels. Possible extensions include embedding semantic-aware color–concept representation layers and collecting cross-cultural annotations to quantify symbolic variation.
A plausible implication is that integration of specialized models—such as ellipsoidal partitioning frameworks for color term categorization (Akbarinia et al., 2017) or integration of robust real-image editing benchmarks like COLORBENCH (Yin et al., 2024)—could enrich future evaluation pipelines by providing explicit ground truth for complex or cross-modal color associations.
7. Comparative Benchmarks and Research Context
ColorConceptBench advances the probabilistic evaluation paradigm relative to benchmarks such as ColorBench (Liang et al., 10 Apr 2025), which targets color perception, reasoning, and robustness in vision-LLMs. While ColorBench demonstrates the scaling law with minor performance gaps (best accuracy ≈57.8%, robustness ≤84.6%) and highlights benefit from larger LLMs over vision encoders, it centers on explicit color cues, extraction, and reasoning.
COLORBENCH (Yin et al., 2024) offers an object-focused real-image standard for evaluating color editing accuracy, relying on human-generated recolorings and six detailed metrics including SSIM, CLIP-Score, hue error, and LPIPS background consistency. Both ColorBench and COLORBENCH provide critical complementary perspectives on color-centric model capability, but neither addresses abstract probabilistic semantic shifts at the scale or semantic granularity of ColorConceptBench.
Collectively, these tools provide a comprehensive battery for T2I and multimodal AI model evaluation, but only ColorConceptBench directly interrogates the human-like mapping from nuanced concepts to color distributions, revealing a persistent and fundamental semantic deficiency in current generative systems.