Cultural Flattening Score (CFS)
- The Cultural Flattening Score (CFS) is a metric that quantifies the degree to which text-to-image models rely on explicit cultural cues to generate representative images.
- It utilizes a hierarchical prompt enrichment process—adding category and regional details—to systematically measure marginal gains in cultural specificity.
- Empirical evaluations using ITA, PS, and diversity scorers reveal that higher CFS values indicate pronounced cultural flattening and a greater need for detailed prompts.
The Cultural Flattening Score (CFS) is a quantitative metric introduced within the CuRe benchmarking framework to assess the extent to which text-to-image (T2I) generative models exhibit "cultural flattening"—a phenomenon where models underrepresent cultural specificity unless provided with highly elaborated, culturally informative prompts. The CFS operationalizes this bias by measuring the prompt elaboration required, on average, for a T2I system to generate outputs that align with specific cultural contexts, providing an interpretable summary statistic in the interval . A higher score indicates that the model requires more explicit cultural cues to generate representative images, suggesting greater flattening; a lower score denotes greater innate cultural awareness of the model (Rege et al., 9 Jun 2025).
1. Hierarchical Data Structure and Prompt Enrichment
The CFS relies on a structured cultural artifact dataset constructed in the CuRe benchmark. Artifacts are organized into a hierarchical taxonomy:
- Six supercategories ("axes"): = {food, art, fashion, architecture, celebrations, people}.
- 32 mid-level categories (e.g., "dumpling," "castle," "hat," etc.), each assigned to a unique axis .
- 300 fine-grained artifacts , each tagged with its specific name, category, axis, and country of origin .
- Each axis contains on average 50 artifacts, and each (category, region) pair includes approximately five artifacts.
This hierarchy enables prompt enrichment in three stages:
- : basic prompt, "An image of ".
- : adds superordinate category information, "An image of , a type of ".
- : fully elaborated, includes region, "An image of , a type of from ".
This incremental conditioning allows for systematic attribution of marginal representational gains to added cultural information.
2. Scoring Backbones and Marginal Gains
Generated images from T2I systems are evaluated via three scorer backbone families:
- Perceptual Similarity (PS): Cosine similarity on SigLIP 2, AIMV2, or DINOv2 image encoder embeddings.
- Image-Text Alignment (ITA): Cosine similarity between image and prompt embeddings in OpenCLIP or SigLIP 2 paired vision–LLM (VLM) spaces.
- Diversity (DIV): LPIPS computed across multiple random seeds per prompt.
For each artifact , the following scores are computed:
- (name only)
- (name + category)
- (name + category + region)
Marginal utility is defined as:
- (gain from adding category)
- (gain from adding region)
- Total gain:
3. Mathematical Formulation of the CFS
The CFS aggregates normalized representativeness gains across artifacts and axes. For each artifact:
- Relative gain: , with for stability.
- Artifact-level score: .
Axis-level flattening is computed as:
- , where is the set of artifacts for axis .
Finally, the model-level CFS is:
A high CFS indicates a model's output quality for culturally distinctive artifacts significantly improves only when given detailed cultural cues, evidencing pronounced flattening.
4. Empirical Evaluation and Comparative Results
The CFS was computed for six representative T2I systems, using both ITA and PS scorers over all 300 artifacts. The results are summarized:
| Model | CFS (ITA) | CFS (PS) | Spearman (CFS, human ) |
|---|---|---|---|
| FLUX.1 [dev] | 0.28 ± 0.11 | 0.25 ± 0.12 | –0.62 |
| SD 3.5 Large | 0.26 ± 0.10 | 0.23 ± 0.10 | –0.59 |
| SDXL | 0.24 ± 0.09 | 0.21 ± 0.08 | –0.53 |
| DALL·E 3 | 0.22 ± 0.12 | 0.19 ± 0.11 | –0.49 |
| SD 1.5 | 0.17 ± 0.14 | 0.15 ± 0.13 | –0.37 |
| Ideogram 2.0 | 0.30 ± 0.13 (ITA only) | 0.27 ± 0.14 | –0.41 |
Interpretation: The CFS varies substantially across systems. Models with higher CFS (e.g., Ideogram 2.0, FLUX.1 [dev]) display greater flattening. The Spearman correlation coefficients are consistently negative, indicating that lower CFS scores (less flattening) are associated with better alignment to human judgments of cultural distinctiveness . This suggests that CFS is a robust proxy for cultural specificity in T2I outputs (Rege et al., 9 Jun 2025).
5. Interpretive Scope, Limitations, and Proposed Extensions
Several methodological and conceptual limitations are identified:
- Geography as proxy: Each artifact is assigned to a single country of origin, , while cultural boundaries are often transnational. Extending CFS to multi-region or cultural-linguistic axes is suggested for future refinement.
- Axis weighting: All six axes are equally weighted in aggregation, but downstream applications may assign greater importance to certain domains (e.g., food). Axis weights derived from user relevance could modulate the CFS.
- Scorer dependence: The choice of introduces possible biases—VLMs may have texture bias or training-induced errors. Utilizing scorer ensembles or a lightweight “cultural judge” supervised by a subset of human labels is proposed.
- Normalization instability: If (full prompt fails), becomes ill-defined. Practical variants could impose lower bounds or adopt non-normalized (absolute gain) formulations.
- Nonlinear gains: The current CFS assumes linear marginal gains. Alternative formulations (e.g., logarithmic returns or Shapley-value attribution) may capture diminishing returns with prompt specification.
6. Theoretical and Practical Implications
The CFS provides a bridging metric between algorithmic behaviors in T2I systems and human-analyzed cultural specificity. By leveraging the marginal utility of additional prompt attributes, it produces a single, interpretable value reflecting a model’s tendency to “flatten” cultural attributes. This facilitates rigorous benchmarking and model comparison across the “long tail” of world cultures, a domain frequently neglected due to web-scraping biases in training data. A plausible implication is that the CFS could become a de facto standard for evaluating model inclusivity on cultural artifacts in generative media pipelines. Moreover, its correlation with human judgments underscores its utility as an evaluative and developmental target (Rege et al., 9 Jun 2025).
7. Relation to Broader Research and Directions
The CFS and the CuRe framework signify a methodological advance in evaluating cross-cultural representation in multimodal generative systems. While the CFS is introduced in the context of T2I models, its structure—hierarchical prompt conditioning, marginal information attribution, and normalized evaluation—could be adapted to other generative modalities (e.g., text generation or audio synthesis) where cultural specificity is relevant. Continued refinement of the scorer backbones, broader artifact taxonomies, and improved aggregation techniques are likely directions for future work. The CFS offers a principled quantitative approach for identifying and mitigating cultural flattening in globally deployed generative AI systems (Rege et al., 9 Jun 2025).