CultureExplorer: Interactive Culture Cartography

Updated 3 July 2026

CultureExplorer is an interactive system for eliciting and organizing culture-specific knowledge through mixed-initiative human-AI collaboration.
The system uses a tree-structured interface where LLMs generate low-confidence cultural questions and human edits guide branching exploration.
Empirical findings show that its Culture Cartography method produces ‘Google-Proof’ data that enhance fine-tuning and benchmark performance.

CultureExplorer is an interactive system for eliciting and organizing culture-specific knowledge that LLMs do not reliably possess, introduced as the tool implementation of the mixed-initiative methodology called Culture Cartography. In its narrow sense, it is a tree-structured web interface built on Farsight that lets an LLM generate low-confidence cultural questions and lets human respondents edit, branch, and correct them so that the resulting knowledge is simultaneously salient to in-group users and difficult for current models. In a broader research sense, the term also appears as a comparative label for systems that support exploratory access to cultural data across text, images, social traces, digital collections, and recommendation interfaces. Taken together, these lines of work define CultureExplorer as a shift from static cultural lookup or survey-style evaluation toward interactive, structured, and often multimodal exploration of culture (Ziems et al., 31 Oct 2025).

1. Definition and conceptual scope

CultureExplorer was proposed to address a specific failure mode in contemporary LLM development: models may be fluent in broad global patterns while still missing local norms and etiquette, culturally specific rituals, regional variants of common practices, and community-specific meanings and symbols. The underlying difficulty is not merely the presence of “hard” questions, but the identification of questions that are both hard for the model and meaningful for the culture. The Culture Cartography paper formalizes this as a mixed-initiative alternative to two single-initiative paradigms: traditional annotation, in which researchers fix the topic and humans answer passively, and knowledge extraction, in which humans supply knowledge that researchers later structure into a dataset (Ziems et al., 31 Oct 2025).

The conceptual background of CultureExplorer aligns with a wider change in cultural-AI research. XCR-Bench argues that LLM “cultural bias” does not imply genuine competence in that culture and that evaluation should focus on identifying, interpreting, and adapting Culture-Specific Items in realistic contexts rather than on survey-style multiple choice questions or isolated substitutions. CultureScope makes the related point that “Language ≠ Culture,” arguing that multilingual data alone does not necessarily yield cultural understanding. These positions jointly reject the misconception that surface fluency or broad factual recall is sufficient for cultural competence (Kabir et al., 20 Jan 2026, Zhang et al., 19 Sep 2025).

A second misconception addressed by the literature is that culture can be reduced to nation-level trivia or visible artifacts alone. XCR-Bench explicitly emphasizes semi-visible and invisible layers such as social etiquette, norms, beliefs, values, and appropriacy, while CultureScope organizes cultural knowledge into Institutional Norms, Behavioral Patterns, and Core Values and Social Structures. This suggests that CultureExplorer is best understood not as a fact-collection tool, but as an interface for surfacing long-tail cultural knowledge that is practical, normative, situated, and often difficult to recover from generic web text (Kabir et al., 20 Jan 2026, Zhang et al., 19 Sep 2025).

2. Mixed-initiative methodology and interaction design

CultureExplorer operationalizes Culture Cartography as a tree-structured web interface built on Farsight. Annotation begins from a seed topic such as gifts, weddings, or funerals. The paper states that these seeds derive indirectly from Brown’s human universals, then are expanded and semantically clustered across eight national cultures to identify user-facing seeds. From such a seed, the LLM generates up to 5 question nodes, and the system explicitly seeks low-confidence outputs rather than confident completions (Ziems et al., 31 Oct 2025).

Confidence is estimated by prompting the same model with the question “Does this answer the question correctly?” The logits are constrained to True/False, and the probability of True is treated as confidence. Answers with confidence below 0.4 are marked uncertain. This design is central: the system does not treat the model as a source of authoritative questions, but as a generator of candidate questions whose uncertainty exposes likely knowledge gaps (Ziems et al., 31 Oct 2025).

Human respondents can edit a question, regenerate it, delete it, or write a new one from scratch. The paper emphasizes that these edits are not cosmetic. They constrain and redirect future generations, enabling the respondent to move the interaction away from generic questions and toward culturally specific, socially meaningful ones. Once a question is accepted, the LLM generates up to 5 answer nodes. These answers are also confidence-scored, with uncertain answers visually highlighted so that users can focus attention on likely model errors (Ziems et al., 31 Oct 2025).

The interface is tree-structured rather than linear. Each answer can become the seed for more question generation, allowing the interaction to branch into multiple thematic directions in parallel. Novel contributions are incentivized through character-level edit distance over user contributions, and users additionally rate AI outputs on a 0–3 Likert scale, where 3 denotes best / cannot be improved and 0 denotes bad / incorrect. In this configuration, the respondent is not only an annotator but a co-designer of the explored knowledge space (Ziems et al., 31 Oct 2025).

3. Data collection, evaluation protocol, and empirical findings

The Culture Cartography study collected data for Nigeria and Indonesia using annotators recruited on Upwork. The reported annotator pools were 9 annotators for Nigeria, spanning 7 ethnolinguistic groups and 5 states, and 19 annotators for Indonesia, spanning 13 ethnolinguistic groups and 12 provinces. Annotators were paid $20/hour, underwent onboarding, and worked in the national language. The study created three non-overlapping subsets: Synthetic Data, in which humans validate top LLM answers to fixed questions; Traditional Annotation, in which humans answer fixed questions and add new answers; and Culture Cartography, in which humans work freely in CultureExplorer, editing questions and adding their own (Ziems et al., 31 Oct 2025).

The primary evaluation criterion is Recall@100, defined as

$R@K = \frac{|\{\text{gold answers}\} \cap \{\text{model answers @}K\}|}{|\{\text{gold answers}\}|}. $</p> <p>The model is iteratively prompted to produce more examples, up to$ K=100 $, and GPT-4o-as-judge determines whether each gold answer is covered by the model’s answers. Human validation of the judge setup is reported at 85% agreement with Cohen’s$ \kappa = 0.66$ (Ziems et al., 31 Oct 2025).

Seven flagship models were evaluated: GPT-4o, o3-Mini, Claude 3.5 Sonnet, DeepSeek R1, Llama-4-Maverick, Qwen 2-72B, and Mixtral-8x22B. The main findings are that CultureExplorer produces harder data than traditional annotation, that DeepSeek R1 is the strongest model overall yet still misses socially important knowledge, and that the resulting data are not easily recoverable through web search. Compared with traditional annotation, Culture Cartography data is reported as 6% less likely to be known by DeepSeek R1 on Indonesia and 10% less likely on Nigeria. DeepSeek R1 still misses 15–18% of Culture Cartography data. GPT-4o with web search performs worse on Culture Cartography than without search, with reported numbers of 65.9% without search versus 61.9% with search for Indonesia, and 69.7% without search versus 54.8% with search for Nigeria. The paper characterizes this property as “Google-Proof” (Ziems et al., 31 Oct 2025).

The downstream consequence is that CultureExplorer-generated data support fine-tuning. The paper reports that fine-tuning on Culture Cartography data improves performance on BLEnD and CulturalBench, with Llama-3.1-8B gaining up to 19.2% accuracy on CulturalBench-Indonesia and 18.2% on CulturalBench-Nigeria. It further argues that such fine-tuning helps close the gap between smaller open models and larger proprietary systems like GPT-4o with search (Ziems et al., 31 Oct 2025).

4. Relation to cultural reasoning benchmarks and evaluators

CultureExplorer emerged within a broader benchmarking movement that seeks to make cultural competence measurable in more structured ways. XCR-Bench introduces a benchmark for cross-cultural reasoning in LLMs built from 4,136 parallel sentences and 1,098 unique CSIs across a Western source setting and Chinese, Arabic, and Bengali target settings, with Bengali split into West Bengal and Bangladesh. It supports three tasks—CSI Identification, CSI Prediction, and CSI Adaptation—and integrates Newmark’s CSI framework with Hall’s Triad of Culture. Its results show that models are better at prediction than identification, that performance declines from Visible to Semi-visible to Invisible cultural levels, and that models struggle particularly with social etiquette and cultural reference. The same study also reports regional and ethno-religious asymmetries within Bengali adaptation, including qualitative preferences such as pujo over Eid and dada over bhai (Kabir et al., 20 Jan 2026).

CultureScope addresses a different layer of the problem: scalable construction of culture-specific knowledge bases and evaluation sets. It proposes a four-level schema with 3 Cultural Layers, 5 Categories, 18 Topic Aspects, and 140 Fine-grained Dimensions, inspired by the cultural iceberg theory. The framework reports 11,962 cultural knowledge instances and 44.1k questions total, and its experiments show that performance is language-dependent, that misleading questions are notably harder than factual or conceptual ones, and that multilingual data alone does not guarantee cultural understanding. The paper’s central diagnosis is that cultural understanding requires explicit cultural knowledge rather than language coverage alone (Zhang et al., 19 Sep 2025).

ExCAM extends the ecosystem from benchmarking to evaluation metrics. It is presented as an Explainable Cultural Awareness Metric for instruction-output pairs and is trained on ExCAM40k, a roughly 40k-example dataset constructed from nine existing cultural benchmarks and augmented with hard and synthetic soft errors. ExCAM produces an MQM-style error report with error count, severity, type, spans, and explanation, while its scalar scoring heuristic uses $s^* = \sum_{i=1}^{n} sev_i$ with minor $=-1$ and major $=-5$ . The best variant, ExCAM (Gemma3), is reported to reach 0.798 balanced error-detection accuracy, or about 80% accuracy on the balanced test set, and to outperform prompting baselines including GPT-5 (Leiter et al., 28 May 2026).

Taken together, these systems frame the larger environment in which CultureExplorer operates. XCR-Bench emphasizes cross-cultural reasoning and adaptation, CultureScope emphasizes theory-grounded coverage and automatic dataset generation, and ExCAM emphasizes reference-free error diagnosis. CultureExplorer differs by centering the data-creation process itself, using model uncertainty and human steering to discover culturally salient knowledge gaps before they are formalized into benchmarks or evaluators (Kabir et al., 20 Jan 2026, Zhang et al., 19 Sep 2025, Leiter et al., 28 May 2026).

5. Multimodal and collection-oriented extensions

Several papers explicitly describe their methods as supporting a “CultureExplorer-style” application or discovery process, extending the idea from LLM knowledge elicitation into visual reasoning, retrieval, and interactive collection access. CuRe is a benchmark and scoring suite for cultural representativeness in text-to-image systems, with 300 cultural artifacts across 32 cultural subcategories and six broad cultural axes—food, art, fashion, architecture, celebrations, and people—spanning 64 countries. Its Marginal Information Attribution scorers quantify how generations change as prompts become more informative, using attribute tuples such as $n$ , $c$ , $r$ , and $s$ , and the paper argues that this reveals cultural gaps in the long tail of T2I systems (Rege et al., 9 Jun 2025).

The Seeing Culture Benchmark extends the exploratory logic to vision-LLMs. It contains 1,065 images, 138 cultural artifacts, 3,178 questions, and 1,093 unique questions across seven Southeast Asian countries and five categories: music, game, dance, celebration, and wedding. Its two-stage evaluation protocol first requires multiple-choice visual question answering and then, conditional on correctness, segmentation of the relevant cultural artifact as evidence. The main result is a visual-grounding gap: models may answer correctly in Stage 1 yet fail to localize the relevant evidence in Stage 2 (Satar et al., 20 Sep 2025).

“Crossroads of Continents” contributes a three-phase multimodal framework with DalleStreet, a synthetic benchmark of 9,935 images from 67 countries and 10 concept classes, followed by automated extraction of more than 18,000 unique cultural artifacts and a modular adaptation pipeline, CultureAdapt. The artifact extraction stage aggregates country-conditioned cues using tf-idf, and the adaptation stage evaluates source-to-target editing with CLIPScore deltas $K=100$ 0 and $K=100$ 1. The paper explicitly treats many extracted associations as implicit and potentially stereotypical, showing that multimodal cultural exploration can expose both cultural knowledge and biased co-occurrence structure (Mukherjee et al., 2024).

A parallel family of systems applies the same exploratory principle to cultural and scientific collections. CollEX is a multimodal agentic RAG system with a proof-of-concept installation containing 64,469 unique records across 32 collections and supporting lexical search, vector similarity search, image analysis, and LVLM function calling through a chat interface. Digital Collections Explorer is a web-based, open-source platform built around CLIP-based natural-language and reverse-image search, demonstrated on maps, photographs, and PDFs, including a map collection of 562,842 images from the Library of Congress and local deployment on a MacBook Pro with an M4 chip. Both systems are presented as ways to turn collections from static search targets into explorable multimodal spaces (Schneider et al., 10 Apr 2025, Huang et al., 1 Jul 2025).

These multimodal works suggest a broader interpretation of CultureExplorer: not only a tool for eliciting missing cultural knowledge from human experts, but also a design pattern in which culture is navigated through structured prompts, hierarchical taxonomies, visual evidence, and interactive retrieval substrates (Rege et al., 9 Jun 2025, Satar et al., 20 Sep 2025, Mukherjee et al., 2024, Schneider et al., 10 Apr 2025, Huang et al., 1 Jul 2025).

Beyond benchmarks and collection interfaces, CultureExplorer-style analysis has been applied to social media, urban mobility, tourism, and cultural heritage. “Instagram Post Data Analysis” uses data from the Instagram API across 50 cities—the most populous city in each U.S. state—to compare filter usage, hashtags, and likes by location. The study treats location as a cultural partition and argues that location-conditioned Instagram data can reveal visual culture differences via filters and event or topical differences via hashtags. It also explicitly notes that this structure supports a CultureExplorer-style application with geographic browsing, comparative analytics, and recommendation support (Chang, 2016).

“Semantic Trails of City Explorations” constructs semantic trails from Foursquare-based check-ins, defining a trail as a temporally ordered list of distinct-venue check-ins by the same user, with an eight-hour threshold for continuity. After filtering and enrichment, STD 2013 contains 18,587,049 check-ins and 6,103,727 trails, while STD 2018 contains 11,910,007 check-ins and 4,038,150 trails. The paper uses these datasets to build a Tourist Sequence Recommender that generates semantically coherent activity sequences rather than isolated POIs, positioning the resulting infrastructure as useful for tourism recommendation and cultural exploration (Monti et al., 2018).

EGO-CH shifts the focus to first-person museum and heritage-site behavior. The dataset contains more than 27 hours of video acquired by 70 subjects, with labels for 26 environments and over 200 Points of Interest, collected at Galleria Regionale di Palazzo Bellomo and Monastero dei Benedettini. It supports four tasks—room-based localization, point of interest/object recognition, object retrieval, and survey prediction—and links egocentric perception to post-visit liking and memory. The paper frames this as a basis for real-time visitor assistance and offline cultural-site analytics (Ragusa et al., 2020).

At city scale, “Discovering Latent Patterns of Urban Cultural Interactions in WeChat for Modern City Planning” uses 56,239,429 check-ins from 9,517,175 users and 2,428,182 venues in Beijing to infer latent cultural typologies via Temporal LDA, select $K=100$ 2 patterns with the Temporal Coherence Value metric, and derive high-resolution demand-supply maps using $K=100$ 3. The result is a behavior-aware planning framework that identifies urban regions with lack of cultural resources (Zhou et al., 2018).

Recommendation research adds a user-control dimension. “Exploration on Demand” proposes adaptive clustering with user-controlled exploration and reports that exploration reduces intra-list similarity from 0.34 to 0.26 while increasing unexpectedness to 0.73. In LLM-based A/B testing with 300 simulated users, 72.7% of long-term users prefer exploratory recommendations over purely exploitative ones. Although this work studies movies rather than culture-specific knowledge, it directly supports the broader CultureExplorer principle that exploration should widen horizons without abandoning relevance (Bianchi, 29 Jul 2025).

7. Limitations, controversies, and future directions

The CultureExplorer paper is explicit that its data are not fully representative of the cultures it studies. Annotators were recruited via Upwork, which may overrepresent people with stable internet access, English ability, and platform visibility, and the number of annotators per group is small. It also notes that culture is broader than question-answer knowledge and includes stories, histories, and artifacts, and warns that LLM-generated suggestions could flatten or misrepresent culture if users over-trust them (Ziems et al., 31 Oct 2025).

These limitations recur across adjacent work. XCR-Bench shows that cultural adaptation can encode regional and ethnoreligious asymmetries even within a single linguistic setting, while “Crossroads of Continents” explicitly treats many extracted country-artifact associations as potentially stereotypical. CuRe notes that geography is used as a proxy for culture and that its benchmark excludes ambiguous artifact names. SCB shows that good multiple-choice accuracy does not imply evidence-grounded understanding. ExCAM warns that synthetic errors may not match real human mistakes and that explanation quality is about plausibility rather than faithfulness. Together, these results show that apparent cultural competence can arise from shortcuts, stereotypes, or benchmark-specific artifacts rather than robust understanding (Kabir et al., 20 Jan 2026, Mukherjee et al., 2024, Rege et al., 9 Jun 2025, Satar et al., 20 Sep 2025, Leiter et al., 28 May 2026).

Data-source bias is another persistent concern. StreetStyle explicitly observes that Instagram users are not a random sample of humanity and that face and person detectors may contain unmeasured bias across age, gender, and race. The Instagram filter-analysis paper notes unpredictable API behavior, sparse hashtag data, and the limited scope of U.S.-only analysis. Digital Collections Explorer emphasizes impoverished metadata, while CollEX reports strong dependence on the underlying LVLM and the absence of formal evaluation. These caveats indicate that exploratory systems often derive power from scale and flexibility, but also inherit the biases of platforms, sensors, retrieval substrates, and foundation models (Matzen et al., 2017, Chang, 2016, Huang et al., 1 Jul 2025, Schneider et al., 10 Apr 2025).

A plausible implication of this literature is that future CultureExplorer systems will combine mixed-initiative knowledge elicitation, theory-grounded schemas, explainable metrics, multimodal grounding, and user-controlled exploration in a single loop. The present record already points in that direction: CultureExplorer supplies participatory long-tail knowledge creation; CultureScope supplies extensible dimensional organization; XCR-Bench and SCB supply reasoning-oriented benchmarks; ExCAM supplies free-text error diagnosis; and multimodal retrieval systems such as CollEX and Digital Collections Explorer supply interactive access layers. In that sense, CultureExplorer names both a specific annotation tool and an emerging research paradigm for making culture computationally explorable without collapsing it into surface artifacts or static questionnaires (Ziems et al., 31 Oct 2025, Zhang et al., 19 Sep 2025, Kabir et al., 20 Jan 2026, Leiter et al., 28 May 2026, Schneider et al., 10 Apr 2025, Huang et al., 1 Jul 2025).