Cultural Authenticity: Comparing LLM Cultural Representations to Native Human Expectations

Published 3 Apr 2026 in cs.CL | (2604.03493v1)

Abstract: Cultural representation in LLM outputs has primarily been evaluated through the proxies of cultural diversity and factual accuracy. However, a crucial gap remains in assessing cultural alignment: the degree to which generated content mirrors how native populations perceive and prioritize their own cultural facets. In this paper, we introduce a human-centered framework to evaluate the alignment of LLM generations with local expectations. First, we establish a human-derived ground-truth baseline of importance vectors, called Cultural Importance Vectors based on an induced set of culturally significant facets from open-ended survey responses collected across nine countries. Next, we introduce a method to compute model-derived Cultural Representation Vectors of an LLM based on a syntactically diversified prompt-set and apply it to three frontier LLMs (Gemini 2.5 Pro, GPT-4o, and Claude 3.5 Haiku). Our investigation of the alignment between the human-derived Cultural Importance and model-derived Cultural Representations reveals a Western-centric calibration for some of the models where alignment decreases as a country's cultural distance from the US increases. Furthermore, we identify highly correlated, systemic error signatures ($ρ> 0.97$) across all models, which over-index on some cultural markers while neglecting the deep-seated social and value-based priorities of users. Our approach moves beyond simple diversity metrics toward evaluating the fidelity of AI-generated content in authentically capturing the nuanced hierarchies of global cultures.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper introduces a human-centered evaluation framework comparing LLM Cultural Representation Vectors with native Cultural Importance Vectors using Pearson correlation, cosine similarity, and MSE.
LLMs demonstrate systemic misrepresentation and Western-centric biases, with facet-level errors diluting locally salient cultural priorities across diverse regions.
Consistent error patterns across models point to shared training biases, highlighting the need for culturally adaptive pretraining and integration of real-time native feedback.

Cultural Authenticity Evaluation of LLMs: Alignment with Native Human Expectations

Introduction

This work addresses the critical problem of cultural representation in frontier LLMs, specifically probing their ability to authentically mirror the priorities of native populations across diverse national contexts. Moving beyond prevailing proxies such as cultural diversity or factual knowledge, the paper establishes a human-centered evaluation framework for assessing cultural authenticity. It introduces two key constructs: human-derived Cultural Importance Vectors and model-derived Cultural Representation Vectors. Using open-ended survey data from nine countries, the authors elucidate the extent to which Gemini 2.5 Pro, GPT-4o, and Claude 3.5 Haiku reproduce the relative importance assigned by natives to eleven salient cultural facets.

Methodology

Establishing Human Ground-Truth: Cultural Importance Vectors

The first phase leverages open-ended survey responses collected across nine countries, operationalizing "cultural importance" as the distribution over eleven discretely defined cultural facets (e.g., Architecture, Cuisine, Social Practices/Customs, Values/Norms/Beliefs/Morality [VNBM]). A high-precision Gemini 2.5-based classification pipeline maps free-text responses to this taxonomy, yielding per-country normalized importance vectors.

Sampling and Reduction of LLM Outputs

In the second phase, the three target LLMs are intensively prompted using a set of 85 syntactically diverse templates per country, each sampled at multiple decoding temperatures. Model responses are autorated to detect which cultural facets are mentioned, producing empirical distributions—Cultural Representation Vectors—analogous to the human-derived baselines. This paradigm enables direct, quantitative comparison between LLM representations and native user expectations.

Alignment Evaluation Metrics

Alignment is assessed via:

Pearson correlation ( $\rho$ ): Linear correspondence in facet prioritization.
Cosine similarity ( $S_C$ ): Hierarchical ranking similarity.
Mean Squared Error (MSE): Penalizes magnitude of misalignment.

Key Findings

Western-Centric Calibration and Geographic Alignment Disparity

Strong geographic disparities are evident in GPT-4o and Claude 3.5 Haiku, with authentic alignment to native priorities decreasing monotonically with cultural distance from the US, measured via the Cultural Fixation Index (CFST). Gemini displays a more stable, less US-centric calibration.

Figure 1: Correlation and cosine similarity between LLM cultural representation and ground-truth as a function of cultural distance from the US; GPT and Claude show clear negative alignment trends as distance increases, unlike Gemini.

Facet-specific MSE reveals that all models tend to misallocate prominence to certain cultural axes (notably Performance and Art, Cuisine, VNBM). Qualitative assessment shows a tendency toward over-saturated, encyclopedic coverage, diluting authentic priority hierarchies observed in human data.

Figure 2: Facet-level MSE quantifying discrepancy between each LLM's output and native importance, aggregated by facet and model.

LLMs routinely underrepresent or flatten the relative magnitude of locally salient facets, producing responses that are broad but insufficiently sensitive to native-defined priorities.

Figure 3: Radar chart comparison of native (GSC) and LLM cultural profiles for select countries, highlighting the tendency of LLMs to output balanced facet distributions rather than mirroring local prioritization.

Highly Correlated Systemic Errors Across Models

Error matrices ( $E^M$ ) constructed for each model (country $\times$ facet) demonstrate near-perfect cross-model correlation ( $\rho > 0.97$ ), evidencing convergence in misrepresentation signatures. This high-fidelity error alignment suggests that misalignment is primarily a function of shared training biases and global web-scale data, not architecture-specific deficiencies.

Figure 4: Inter-model error consistency; left shows correlation by country, right by facet. Consistency across models is uniformly high, especially at the country level.

Implications and Theoretical Context

The findings substantiate a touristic gaze effect in LLMs—models saturate responses with externally legible, "front stage" cultural markers while neglecting the nuanced, internally prioritized elements expressed by natives. This overgeneralization aligns with prior sociological and anthropological constructs (e.g., Urry’s "Tourist Gaze", Appadurai’s "Mediascapes"), underscoring the risk of algorithmic reinforcement of homogenized, outsider-centric cultural narratives.

Practically, this represents a significant limitation for LLM deployment in globalized, culturally sensitive contexts—LLMs may provide informationally rich yet inauthentic responses, perpetuating Western-centric priorities and potentially occluding local representational agency. The systemic and model-agnostic nature of these errors highlights the need for interventions that transcend architecture: diversification of pretraining corpora, explicit coupling with human importance signals, and dynamic adaptation to local user feedback.

Future Directions

Future model evaluation frameworks must shift toward user-priority-aware alignment, departing from surface-level coverage metrics. Key development angles include:

Incorporation of interactive or reinforcement learning grounded in native user feedback.
Stratified pretraining or fine-tuning on corpus slices curated for authentic, local salience.
Real-time recalibration or re-ranking of outputs based on user context and evolving human baselines.

The approach outlined in this work offers a scalable blueprint for evaluating and ultimately enhancing the authenticity of LLM cultural representations, moving the field closer to truly user-aligned generative AI.

Conclusion

This paper delivers a rigorous, dual-study framework for quantifying LLM cultural authenticity at the intersection of anthropology, NLP, and evaluation science (2604.03493). By grounding comparative analysis in the authentic priorities of native populations, the work exposes systemic, Western-centric misalignment in current frontier models, with error patterns consistent across model families and countries. The results emphasize the need for human-centered, priority-sensitive evaluative paradigms as a prerequisite for equitable and authentic representation in global language technologies.

Markdown Report Issue