The Curious Case of Curiosity across Human Cultures and LLMs (2510.12943v1)

Published 14 Oct 2025 in cs.CL

Abstract: Recent advances in LLMs have expanded their role in human interaction, yet curiosity -- a central driver of inquiry -- remains underexplored in these systems, particularly across cultural contexts. In this work, we investigate cultural variation in curiosity using Yahoo! Answers, a real-world multi-country dataset spanning diverse topics. We introduce CUEST (CUriosity Evaluation across SocieTies), an evaluation framework that measures human-model alignment in curiosity through linguistic (style), topic preference (content) analysis and grounding insights in social science constructs. Across open- and closed-source models, we find that LLMs flatten cross-cultural diversity, aligning more closely with how curiosity is expressed in Western countries. We then explore fine-tuning strategies to induce curiosity in LLMs, narrowing the human-model alignment gap by up to 50\%. Finally, we demonstrate the practical value of curiosity for LLM adaptability across cultures, showing its importance for future NLP research.

Summary

The paper introduces CUEST, a framework that quantifies human–LLM curiosity alignment across cultures using metrics on linguistic style, topic preference, and social constructs.
It demonstrates that adapter-based fine-tuning notably improves model curiosity expression, narrowing the disparity to human baselines by up to 50%.
The study reveals significant cultural variation in curiosity and highlights the need for culturally adaptive models to capture diverse human inquiry.

Curiosity Across Cultures: Systematic Evaluation of Human–LLM Alignment

Introduction and Motivation

Curiosity is a fundamental driver of human inquiry, learning, and cultural evolution, yet its cross-cultural manifestations and their alignment with LLMs remain underexplored. This paper introduces CUEST (CUriosity Evaluation across SocieTies), a comprehensive framework for evaluating the alignment between human and LLM curiosity across cultures, leveraging a large-scale, real-world dataset (Yahoo! Answers) and a suite of open- and closed-source LLMs. The paper systematically analyzes how curiosity-driven questioning varies by country and topic, and how LLMs reproduce, flatten, or diverge from these patterns. The work further investigates methods to induce culture-aware curiosity in LLMs and assesses the downstream impact of such induction on cultural adaptability benchmarks.

Cross-Cultural Variation in Curiosity

The analysis begins by establishing that curiosity-driven questions differ substantially in both style and content across countries. Using Yahoo! Answers data from 18 countries and 16 topics, the paper demonstrates that human curiosity is not monolithic but is shaped by local norms, values, and communicative styles.

Figure 1: Curiosity-driven question differences across countries, illustrating stylistic and topical diversity in Argentina, India, Italy, and the Philippines.

This diversity is operationalized through both qualitative and quantitative analyses, revealing that, for a given topic, the nature of curiosity—whether it is outward-facing, comparative, or introspective—varies significantly by cultural context.

The CUEST Framework: Multidimensional Alignment

CUEST evaluates human–LLM alignment along three axes: linguistic alignment (question style), topic preference alignment (content), and grounding in social science constructs (theoretical context).

Figure 2: CUEST Overview: Linguistic Alignment, Topic Preference Alignment, and Grounding to Social Science Constructs for human–LLM comparison.

Linguistic Alignment

Linguistic alignment is measured using four metrics: ambiguity, rhetorical devices, open-endedness, and cohesion. The results show a systematic divergence: humans exhibit higher ambiguity and cohesion, while LLMs favor open-endedness. Rhetorical device usage varies by model and country.

Figure 3: Linguistic Alignment—humans are higher in ambiguity and cohesion, LLMs in open-endedness.

Country-level analysis reveals that LLMs align most closely with Western countries, with decreasing alignment in Latin American and Eastern countries. Notably, East Asian countries exhibit the largest gaps.

Figure 4: Country-wise Linguistic Alignment—Western countries show the highest human–model alignment.

Model-wise, LLaMA-3-8b achieves the closest alignment to human linguistic patterns, outperforming larger and more instruction-tuned models such as GPT-5.

Topic Preference Alignment

Topic preference alignment is assessed by comparing the frequency and ranking of topics in human and LLM-generated questions. While some country-level similarities exist, there is no consistent regional pattern. LLMs tend to emphasize topics such as "Family and relationships" and "Music and entertainment," whereas humans prioritize "Arts and humanities," "Beauty and style," and "Business and finance," indicating a systematic divergence in curiosity-driven content.

Figure 5: Country-wise Topic-Preference Alignment—LLaMA-3-8b correlation with humans, showing notable country-level but not region-level patterns.

Figure 6: Topic-Preference Ranking—heatmap reveals clear divergence between human and LLM topic preferences.

Only LLaMA-3-8b achieves a modest positive correlation with human topic rankings; other models, especially GPT-4o and GPT-5, show negative or near-zero correlations.

CUEST grounds its analysis in established social science theories, including Hofstede's cultural dimensions, Schwartz's values, Hall's context-dependence, and educational system typologies. LIWC-based computational proxies are used to map question content to these constructs.

Figure 7: Social Science Constructs—LIWC-based computation of cultural dimensions in questions.

Region-wise, LLMs are consistently closer to humans in Western/Anglo countries than in Eastern or Latin American contexts.

Figure 8: Social Science Theory Alignment—region-wise MAE for models, highlighting West vs. East/LatAm differences.

Alignment with traditional expectations is mixed: both humans and LLMs often deviate from canonical cultural stereotypes, especially for high/low-context communication and rote vs. holistic learning. LLaMA-3-8b aligns best with human data across most dimensions.

Inducing Culture-Aware Curiosity in LLMs

The paper explores methods to induce curiosity in LLMs, focusing on countries with low human–LLM alignment (Brazil, UK, Philippines). Two fine-tuning strategies are evaluated: full fine-tuning and adapter-based fine-tuning, using both Yahoo! Answers and Reddit data. Adapter-based fine-tuning, especially when trained on Reddit data with conversational scaffolding (obj2), yields the closest alignment to human curiosity, outperforming both full fine-tuning and prompt-based methods.

Figure 9: Linguistic Analysis on fine-tuned vs. non-fine-tuned models—adapter-based methods with conversational objectives on Reddit data achieve the lowest model–human distances.

Fine-tuning narrows the human–model alignment gap by up to 50%, with adapter-based models showing the greatest improvements in rhetorical device usage and open-endedness, and full fine-tuning improving cohesion.

Figure 10: Fine-tuning (full/adapter) results—adapters improve rhetorical devices and open-endedness, full fine-tuning improves cohesion.

Country-wise, the best fine-tuned models approach human baselines, especially in Brazil, UK, and the Philippines.

Figure 11: Country-wise analysis—best fine-tuned model performance is much closer to human baseline than non-fine-tuned models.

Inherent Curiosity and Downstream Impact

Inherent curiosity is evaluated by prompting models with ambiguous or open-ended statements and measuring their propensity to ask relevant, information-seeking questions. Adapter-fine-tuned models ask questions in 75% of cases, compared to 20% for full fine-tuning and near-zero for prompt-based methods.

Figure 12: Annotation Framework for inherent curiosity evaluation—measuring question-asking and relevance.

Curiosity-induced models are further evaluated on three cultural adaptability benchmarks: NormAd, CulturalBench, and Cultural Commonsense. Curiosity-enabled variants consistently outperform non-curious baselines, with the best fine-tuned models achieving up to 8% higher accuracy on NormAd and 7.7% on Cultural Commonsense.

Theoretical and Practical Implications

The findings have several implications:

LLMs flatten cross-cultural diversity: LLMs, especially those with extensive RLHF and safety tuning, tend to produce generic, Western-normative outputs, reducing the diversity of curiosity expression.
Human curiosity is not strictly stereotypical: Human questioning often deviates from traditional cultural stereotypes, highlighting the need for models that capture the full spectrum of human curiosity.
Adapter-based fine-tuning is effective and efficient: Adapter methods induce curiosity-driven behaviors without overfitting or overwriting general capabilities, and are computationally efficient.
Curiosity enhances cultural adaptability: Models trained to ask culturally aware questions perform better on downstream tasks requiring nuanced cultural understanding.

Limitations and Future Directions

The paper is limited by its focus on 18 countries and English-language data (with translations), potentially biasing results toward WEIRD populations. The grounding in established social science theories, while comprehensive, may not capture all contemporary or subcultural variation. Evaluation of curiosity remains partly subjective, and real-world conversational dynamics are not fully captured.

Future work should expand to multilingual, multimodal, and interactive settings, incorporate alternative cultural frameworks, and further explore the integration of curiosity into LLM training and evaluation pipelines.

Conclusion

This work provides a systematic, multidimensional evaluation of curiosity across human cultures and LLMs, revealing significant gaps and proposing effective methods to induce culture-aware curiosity in LLMs. Adapter-based fine-tuning, especially with conversational objectives and diverse data, substantially improves human–model alignment and downstream cultural adaptability. The CUEST framework and associated findings offer a foundation for future research on building inclusive, culturally adaptive, and genuinely curious LLMs.

PDF Markdown

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What is this paper about?

This paper studies curiosity — the urge to ask questions and learn — and how it looks different in people from different countries. It then checks whether today’s AI LLMs show those same differences when they “act” like people from different cultures. The authors introduce a tool called CUEST to measure how closely LLMs match human curiosity across cultures, and they try ways to make LLMs more curious and culturally aware.

What questions did the researchers ask?

They focused on three simple questions:

Do people from different countries show different styles and interests when they ask questions online, and do LLMs copy these patterns?
Can we train LLMs to be more curious in a way that respects cultural differences?
Does adding curiosity actually help LLMs do better on cultural understanding tasks?

How did they paper it?

Think of their approach like comparing music tastes and styles:

They looked at “songs” (questions people ask) from many countries.
They compared those to “songs” created by LLMs pretending to be from those countries.
They checked both the “style” of the questions (how they’re written) and the “playlist” choices (which topics each country is most curious about).
They also checked their results against “music theory” — well-known social science theories about culture — to see if patterns made sense.

Here’s what they did in everyday terms:

Human questions: They used real questions from Yahoo! Answers posted by people in 18 countries, across 16 topics (like family, health, arts). Questions in local languages were translated to English to compare fairly.
LLM questions: They asked several AI models to role-play (“Assume you are from <country>…”) and generate:
- Questions on the same topics (to capture style)
- Rankings of which topics people in that country would be most curious about (to capture content preferences)
CUEST evaluation (their measurement framework) had three parts:
- Ambiguity: Could the question be understood in more than one way? (Ambiguity can signal deeper exploration.)
- Rhetorical devices: Does it use creative language (like metaphors) to explore ideas?
- Open-endedness: Is there more than one possible answer?
- Cohesion: Do the ideas in the question connect smoothly?
- 2) Topic Preference Alignment (content): Do the model’s favorite topics for each country match the human patterns?
- 3) Grounding in Social Science: They compared patterns with well-known cultural theories (like Hofstede’s dimensions, Schwartz’s values, and Hall’s high/low context communication) to see if the curiosity patterns made cultural sense.
Making LLMs more curious (fine-tuning): They tried two training strategies on the best-performing model:
- Full fine-tuning (changing many model weights)
- Adapter fine-tuning (adding a small “attachment” that teaches new behavior without changing the whole model)
- They trained on questions from Yahoo! Answers and from country-specific Reddit communities. They tested two training styles:
- obj1: “Write one question from a person in <country>” (direct)
- obj2: Add a short “setup” sentence before the question, like a mini conversation, to encourage natural curiosity.
Inherent curiosity test: They gave models short statements (some general, some country-specific) that naturally invite follow-up questions, then checked:
- Curiosity rate: Does the model ask a question at all?
- Relevance: Is the follow-up question on-topic?

What did they find, and why does it matter?

Main findings:

LLMs flatten cultural differences. Human questions vary a lot by country, but LLM-generated questions look more similar across countries and lean toward Western styles. In style:
- Humans: more ambiguity and stronger cohesion (ideas connect closely).
- LLMs: more open-ended questions (broad, general).
Topic preferences diverge. LLMs often favored topics like “Family and relationships” or “Music and entertainment,” while humans prioritized things like “Arts and humanities,” “Beauty and style,” and “Business and finance.” Only one model (LLaMA-3-8b) showed modest positive agreement with human topic rankings; others didn’t match well.
Social science grounding shows gaps. Models aligned better with Western countries than with Eastern or Latin American ones. Some cultural theories matched the data (e.g., “Openness vs. Conservation” from Schwartz), while others didn’t (e.g., Hall’s high/low context), even for humans, likely because online platforms reduce social pressure and cultures change over time.
Training can make models more curious and more culturally aligned. Adapter fine-tuning, especially with Reddit data and the conversational training style (obj2), worked best. It reduced the gap between human and model curiosity by up to about 50% (depending on the measure).
Inherent curiosity improved. Adapter-tuned models asked follow-up questions far more often (about 75% of the time) and kept them relevant, while fully fine-tuned models asked fewer (about 20%).
Curiosity helps on real tasks. When models first asked themselves one or two sub-questions before answering, they did better on three cultural benchmarks (judging social norms, answering culture knowledge questions, and cultural commonsense). The biggest gains came when combining curiosity prompting with the curiosity-trained model.

Why this matters:

Friendlier, fairer AI: If LLMs understand that curiosity looks different across cultures, they can ask better follow-up questions, avoid stereotypes, and fit local norms more respectfully.
Better performance: Curiosity isn’t just “cute” — it helps models reason more carefully and make better decisions in culturally sensitive tasks.

What could this change in the future?

Build-in curiosity as a skill: Training LLMs to ask thoughtful, culturally aware questions can make them more helpful and adaptable everywhere.
Respect cultural variety: Instead of using one-size-fits-all behavior, models can learn to match different communication styles and interests.
Safer cultural alignment: Adapter-based methods can add culture-aware curiosity without overwriting a model’s general knowledge, reducing the risk of losing other abilities.
Better benchmarks and training: The CUEST framework can guide future research on how LLMs ask questions, not just how they answer them.

Overall, this paper shows that curiosity is a powerful tool for making AI more inclusive, more accurate, and more human-aware across different cultures.

View Paper Prompt View All Prompts

Knowledge Gaps

Unresolved Knowledge Gaps, Limitations, and Open Questions

Below is a single, consolidated list of gaps that are missing, uncertain, or left unexplored, framed to be concrete and actionable for future research.

Dataset representativeness and temporal validity: Yahoo! Answers is an older, English-heavy, anonymous platform with unknown user demographics and time ranges; verify whether it meaningfully reflects current, national curiosity patterns and complement it with contemporary, native-language Q&A sources, search logs, and region-specific forums.
Translation-induced distortion: Quantify how machine translation alters rhetorical devices, ambiguity, and cohesion across cultures; run bilingual human audits and use multilingual models to generate/analyze questions in native languages to minimize translation bias.
Construct validity of curiosity metrics: Validate whether ambiguity, open-endedness, cohesion, and rhetorical devices reliably measure curiosity cross-culturally; develop and test psychometrically grounded measures (e.g., information gap magnitude, epistemic vs. social curiosity typologies) with diverse human annotations.
Theory grounding via LIWC: The mapping from question text to Hofstede/Schwartz/Hall constructs using LIWC is not validated cross-culturally; build and evaluate culturally sensitive coding schemes and consider contemporary frameworks (e.g., narrative identity, intersectionality, microcultures) to avoid over-reliance on legacy models.
Persona prompting validity: Assess whether LLMs truly adopt cultural personas versus merely echo stereotypes; perform manipulation checks, vary prompt formulations systematically, and measure prompt sensitivity and robustness across models.
Statistical rigor and reporting: Provide per-country/topic sample sizes, confidence intervals, effect sizes, multiple-comparisons corrections, and power analyses; expand beyond three random seeds to characterize inter-run variability and ensure replicability.
Attribution of “Western alignment” and flattening: Test hypothesized causes (pretraining corpora composition, RLHF/safety tuning) with controlled ablations across model variants and training pipelines to move beyond speculation.
Topic preference operationalization: Question frequency as a proxy for national curiosity is confounded by platform dynamics; incorporate ground-truth preference signals (surveys, search trends, media consumption) and explain negative correlations observed for certain countries (e.g., UK, Mexico).
Fine-tuning data representativeness: Country subreddits may reflect diaspora, English-speaking minorities, or moderation artifacts; audit the cultural authenticity of Reddit data and source country-local platforms to reduce sampling bias.
Lexical augmentation limitations: Simple synonym/word-swap augmentation may distort style and rhetorical features; evaluate human-perceived fidelity and use paraphrase/back-translation or style-controlled augmentation instead.
Model generalization: Fine-tuning and adapter results are shown only for LLaMA-3-8b; test portability of adapter-based curiosity induction across diverse model families, sizes, and guardrailed deployments.
Inherent curiosity evaluation scope: Current evaluation is single-turn with two human annotators; expand to multi-turn, cross-cultural raters, measure appropriateness and user satisfaction, and detect over-questioning or intrusive behavior.
Safety, bias, and ethics: Curiosity induction can amplify stereotypes, privacy violations, or culturally inappropriate probing; design guardrails, consent-aware questioning policies, and harm audits across sensitive domains and cultures.
Multilingual/dialectal coverage: Extend CUEST beyond English to native languages, dialects, and code-switching contexts; develop multilingual rhetorical device detectors and language-specific feature sets.
Rhetorical devices measurement: Translation likely obscures figurative language; build annotated corpora and classifiers for rhetorical devices that are robust across languages and cultural styles.
Downstream evaluation breadth: Improvements are shown on three English-centric benchmarks; test impact on real-world applications (e.g., tutoring, health advice, customer support) across cultures and languages, including longitudinal outcomes.
Curiosity vs. deliberation confound: Disentangle gains due to curiosity-driven questioning from generic increases in deliberation tokens (e.g., chain-of-thought); run ablations (self-ask vs. CoT vs. structured questioning) with controlled token budgets.
Curiosity calibration: Determine context-sensitive question-asking rates and triggers (uncertainty, cultural mismatch) to avoid user burden; develop metrics and policies for adaptive, situation-appropriate curiosity.
Stereotype reinforcement dynamics: LLM alignment improves when human curiosity matches stereotypes; design detection/mitigation methods for stereotype reliance in question generation and test with counter-stereotypical scenarios.
Hall’s high/low-context non-alignment: Both human and model results diverge from expected patterns; revisit measurement instruments by analyzing contextualized discourse (conversations, narratives) instead of standalone questions.
Country grouping granularity: The Western/Eastern/Latin classification is coarse; investigate subregional, urban/rural, socio-economic, age, and gender segments to capture within-country cultural variability.
Reproducibility and openness: Closed-source models limit replication; release complete prompts, seeds, sampling/decoding settings, generated datasets, and evaluation scripts to standardize CUEST replication.
Temporal dynamics: Analyze how curiosity styles and topic preferences shift over time (e.g., pre/post major cultural events) and whether LLMs capture these temporal trends.
Human annotation of curiosity quality: Move beyond limited subjective evaluation to build cross-cultural annotation campaigns labeling curiosity type, intensity, and appropriateness; report inter-annotator agreement and adjudication procedures.
Metric robustness under translation: Conduct controlled experiments with parallel corpora to estimate how translation affects ambiguity/cohesion scores and calibrate metrics accordingly.
Benchmark cultural validity: NormAd, CulturalBench, and Cultural Commonsense are English-centric; develop multilingual versions and country-specific variants and test cross-lingual transfer and fairness.
Integration into training pipelines: Specify how curiosity can be systematically integrated into RLHF or post-training (e.g., curiosity-sensitive reward models, adaptive questioning policies) and evaluate trade-offs with safety and efficiency.

View Paper Prompt View All Prompts

Practical Applications

Below is an overview of practical, real-world applications that stem from the paper’s findings, methods (CUEST), and demonstrated innovations (adapter-based curiosity induction and “prompt-ask” workflows). Items are grouped by deployability horizon and, where relevant, mapped to sectors and potential tools or workflows. Each item includes assumptions or dependencies that may affect feasibility.

Immediate Applications

Culture-adaptive customer support and sales assistants (software; industry)
- Use adapter-based fine-tuning (LoRA) on a performant base model (e.g., LLaMA‑3‑8B) to induce curiosity and add a “prompt-ask” step so assistants ask one or two clarifying questions before resolving tickets or making recommendations.
- Workflow: CUEST-scored evaluation → adapter fine-tuning with obj2 conversational scaffolding → runtime “curious-ask” guardrail (Self‑Ask style) → deploy per market persona.
- Assumptions/dependencies: Availability of representative regional conversational data (e.g., Reddit, local forums), safe RLHF integration, translation fidelity for non-English queries, and monitoring to avoid stereotyping.
Culturally aware triage and intake in healthcare (healthcare; industry/public sector)
- Incorporate curiosity-driven follow-up questions during symptom collection, social determinants screening, or mental health check-ins to reduce misinterpretation across cultural contexts.
- Tools: Adapter-fine-tuned “curious” module, CUEST linguistic metrics (ambiguity, cohesion) for QA of prompts, and audit on Cultural Commonsense benchmarks.
- Assumptions/dependencies: Clinical validation and safety review, HIPAA/GDPR compliance, interpreter/translation support, and bias audits to ensure curiosity doesn’t reinforce stereotypes.
Cross-cultural tutoring and coaching (education; industry/academia)
- EdTech tutors that dynamically adjust question style (direct vs. indirect, open-endedness, rhetorical devices) by learner locale and background, increasing engagement and comprehension.
- Workflow: CUEST alignment checks per target cohort → adapter fine-tuning with obj2 → in-session “curious-ask” routine for formative assessment.
- Assumptions/dependencies: Curriculum alignment, teacher oversight, multilingual capabilities, and data to calibrate Hall’s high/low-context adaptations.
Content localization and UX research (media/marketing/product; industry)
- Use CUEST to compare human vs. model topic preferences and question styles across countries, guiding localized content strategy (e.g., campaign Q&A, help center, onboarding).
- Tools: Topic Preference Alignment (Spearman/Kendall), LIWC-based theory grounding for tone/style audits, A/B testing with “curious-ask” interactions.
- Assumptions/dependencies: Reliable market data per locale; guardrails to avoid over-reliance on Western stereotypes seen in baseline LLMs.
Safer automation via clarifying questions (platform safety/compliance; industry/policy)
- Before executing sensitive actions (account changes, financial transfers, content moderation), assistants ask culture-aware clarifying questions to reduce false positives/negatives.
- Workflow: “Prompt-ask” interstitial + risk-aware policies → evaluate gains (e.g., NormAd accuracy improvements) → phased rollout.
- Assumptions/dependencies: Latency budgets, user tolerance for extra steps, and robust privacy/consent handling.
Survey and questionnaire design assistants (academia/policy; social science)
- Authoring tools that recommend culturally appropriate question framing (ambiguity, open-endedness, rhetorical devices) to improve response quality in international studies.
- Tools: CUEST’s linguistic metrics; LIWC grounding with Hofstede/Schwartz for tone calibration; generator that produces variants per country persona.
- Assumptions/dependencies: Domain expert review, multilingual nuance preservation, and avoidance of outdated stereotype templates.
Cross-cultural onboarding for global teams (HR/L&D; industry)
- Microlearning bots that model curiosity through context-aware questions about workplace norms, meetings, feedback styles, and social rituals by region.
- Tools: Adapter-based “curiosity” induction; NormAd-derived scenarios; “curious-ask” interactions for reflective practice.
- Assumptions/dependencies: Organizational buy-in, content validation by local employees, and privacy-preserving analytics.
Personal assistants for travel and daily life (consumer software; daily life)
- Travel planning and etiquette helpers that ask clarifying questions (e.g., dietary norms, social rituals) to reduce misunderstandings and improve recommendations.
- Workflow: Country persona → curiosity prompts → follow-up question generation → refined plan.
- Assumptions/dependencies: Up-to-date local data, language support, and careful handling of local sensitivities.
Model evaluation pipelines for R&D (software; academia/industry)
- Integrate CUEST into model QA to measure cultural curiosity alignment (linguistic facets, topic preferences, theory grounding) before deployment to new markets.
- Tools: Open-source CUEST repo; benchmark suite (NormAd, CulturalBench, Cultural Commonsense); adapter vs. full FT comparators.
- Assumptions/dependencies: Internal evaluation resources, dataset licenses, and versioning discipline.

Long-Term Applications

Curiosity-aware RLHF and training objectives (software; academia/industry)
- Build training pipelines that optimize for culture-aware curiosity (asking relevant clarifying questions) alongside helpfulness/safety, reducing Western-centric flattening.
- Potential product: “Curiosity-RLHF” module that co-optimizes curiosity rate and relevance with fairness constraints.
- Assumptions/dependencies: Scalable multilingual datasets, new reward models, and careful safety alignment to avoid intrusive or biased questioning.
Multilingual, culturally grounded LLMs beyond translation (software; industry/academia)
- Native-language training that preserves rhetorical devices, indirectness, and cohesion—improving alignment in East Asian and Latin American contexts where baseline gaps were largest.
- Tools: Curiosity corpora in local languages; regional Reddit/online forums; language-specific LIWC-like lexicons.
- Assumptions/dependencies: Data acquisition, community partnerships, and region-specific evaluation standards.
Regulatory and standards development for cultural adaptability (policy; public sector/industry)
- Create audit standards that include curiosity alignment (CUEST) as part of fairness and global deployability certifications (ISO/IEEE-like).
- Potential workflow: Pre-market audit → periodic re-certification → incident reporting tied to cultural misinterpretation metrics.
- Assumptions/dependencies: Multi-stakeholder consensus, measurement validity, and non-discrimination enforcement.
Personalized cultural profiles and adaptive dialogue (software; industry)
- Assistants that infer and continuously update a user’s cultural interaction preferences (directness, topic interests, curiosity tolerance) to personalize question-asking behavior.
- Tools: On-device preference learning; privacy-preserving signals; adaptive “curious-ask” controllers.
- Assumptions/dependencies: Strong privacy/consent models, transparency, and opt-in controls.
Cross-cultural decision support in finance and public services (finance/government; industry/public sector)
- Loan officers’ assistants or benefits eligibility systems that use curiosity-driven questioning to surface context (e.g., informal employment norms, family obligations).
- Workflow: Policy-compliant “prompt-ask” layer → adjudication rationale logging → fairness audits.
- Assumptions/dependencies: Regulatory approval, explainability requirements, and rigorous bias testing.
Social media and community design to reduce misinterpretation (platform design; industry/academia)
- “Curiosity by default” interaction features (gentle clarifying questions before posting/replying) to reduce cross-cultural misunderstandings and polarization.
- Tools: Lightweight “curious-ask” nudges; localized phrasing; A/B testing of conflict reduction.
- Assumptions/dependencies: User acceptance, measurable harm reduction, and abuse prevention.
Embodied AI and service robots (robotics/hospitality/healthcare; industry)
- Robots that adapt questioning style to local norms (indirectness in high-context cultures, direct clarification in low-context), improving guest experiences and patient interactions.
- Tools: On-device adapters; sensor-informed contextual question generation; CUEST-based evaluations.
- Assumptions/dependencies: Real-time compute, safety certification, and multimodal integration.
Crisis response and humanitarian assistance (public sector/NGO)
- Field assistants that ask culturally sensitive clarifying questions in disaster relief, refugee services, or public health outreach to avoid harmful misunderstandings.
- Workflow: Offline-capable “curious-ask” packs; region-specific adapters; post-incident review metrics (curiosity rate/relevance).
- Assumptions/dependencies: Low-resource deployment, local language support, and ethical guidelines.
Cultural analytics for policy and foresight (academia/policy)
- Track shifts in societal openness, uncertainty avoidance, and topic curiosity over time using CUEST-like metrics, informing cultural diplomacy or education reforms.
- Tools: Time-series pipelines; LIWC/construct grounding; dashboards with region/country segmentation.
- Assumptions/dependencies: Ethical data sourcing, longitudinal coverage, and interpretability safeguards.
Enterprise localization “OS” (software; industry)
- Platform layer orchestrating question-style adaptation across products (support, marketing, onboarding) with centralized CUEST scoring and adapter deployment per locale.
- Assumptions/dependencies: Integration cost, governance for stereotype avoidance, and cross-team alignment.
Cross-cultural clinical decision support (healthcare; industry/academia)
- Diagnostic systems that ask culturally aware clarifying questions about symptoms, diet, and home remedies to improve accuracy in diverse populations.
- Assumptions/dependencies: Clinical trials, medical liability frameworks, and guardrails against bias.

In all cases, the paper’s evidence supports two design pillars that underpin these applications:

Evaluation-first deployment: Use CUEST to measure linguistic alignment, topic preferences, and theory-grounded scores per target market before shipping.
Curiosity-as-a-mechanism: Add a “curious-ask” step (prompt-ask) and, when feasible, adapter-based curiosity induction (obj2 + conversational scaffolding) to improve cultural adaptability and downstream task performance.

View Paper Prompt View All Prompts

Glossary

Adapter-based fine-tuning: A parameter-efficient method that adds small adapter modules to a model’s layers to specialize behavior without updating all weights. "adapter-based fine-tuning achieves the best performance and generates contextually relevant questions in conversations."
Ambiguity: The degree to which a question or statement can be interpreted in multiple ways, inviting exploration. "ambiguity: captures the degree to which a question invites multiple interpretations, which is central to curiosity as it leads to exploration"
Cohesion score: A quantitative measure of how well a text connects ideas using cohesive devices and linking concepts. "cohesion score: measures how well a question links concepts together -- connecting ideas can show deeper engagement"
Cultural personas: Prompted identities that instruct models to generate content from the perspective of a specific country or culture. "LLM-generated questions and topic preferences (using cultural personas)."
CulturalBench: A benchmark of human-verified questions across regions to probe cultural understanding. "CulturalBench is a set of 1,227 human-verified questions from 45 regions across different topics to probe cultural understanding (we use the Hard subset)"
Cultural Commonsense: An evaluation that tests models’ ability to reason about culturally specific everyday inferences. "Cultural Commonsense tests models’ ability to reason about culturally specific everyday inferences (we use the English fill-in-the-blanks subset on country customs)"
CUEST: An evaluation framework for measuring human–LLM alignment in curiosity across cultures via linguistic, topic, and theory-based analyses. "We introduce CUEST: a framework combining linguistic and content analysis with cultural grounding to analyze human and LLM curiosity questions across cultures."
F-stat: A statistic from ANOVA used to compare variances across groups; here used to quantify cross-cultural variance differences. "F-stat:$7.33)$ -- LLMs tend to flatten cross-cultural differences."
Grounding in Social Science Constructs: Situating analyses within established cultural theories to interpret patterns meaningfully. "Grounding in Social Science Constructs to contextualize curiosity within established cultural frameworks."
Hall's high-context vs low-context: A cultural communication dimension contrasting implicit, indirect styles with explicit, direct styles. "Hall's high context cultures rely on implicit cues, and tentativeness, meaning curiosity often manifests in indirect forms."
Hofstede's Cultural Dimensions: A framework of national culture dimensions used to analyze cross-cultural differences. "Hofstede's Cultural Dimensions: We operationalize 4 dimensions related to curiosity:"
Indulgence-Restraint (IR): A Hofstede dimension contrasting enjoyment and hedonism with discipline and control. "Indulgence-Restraint (IR): indulgent culture frame curiosity as playful and hedonic, while restrained cultures emphasize discipline and self-control"
Individualism-Collectivism (IC): A cultural dimension contrasting self-oriented with group-oriented values and inquiry. "Individualism-Collectivism (IC): expressing as a self-related inquiry or a social exploration"
Kendall correlation: A rank correlation measure (Kendall’s tau) assessing ordinal association between rankings. "We now compare culture-specific LLM topic preferences with the human-authored ones using Spearman and Kendall correlations ( $[-1,1]$ ) ($1=$ very similar to humans, $0=$ unrelated, $-1=$ inverse)."
L2 distance: The Euclidean distance metric used to quantify differences between model and human scores. "x-axis show model-human distances (L2) (averaged across metrics and countries)."
LIWC: Linguistic Inquiry and Word Count, a psycholinguistic lexicon used to compute category-based text scores. "We compute LIWC scores corresponding to the above dimensions, showed in Fig~\ref{fig:liwc_ss}."
Long-/Short-term orientation (LSO): A Hofstede dimension contrasting future-oriented with present/past pragmatic focus. "Long-/Short-term orientation (LSO): if curiosity is future oriented or present/past-oriented and pragmatic"
MAE (Mean Absolute Error): The average absolute difference between values; used to summarize alignment gaps. "Social Science Theory Alignment: Region-wise MAE for models shows West vs East/Lat Am. Differences are prominent across models."
Mean absolute differences: The average of absolute differences between model and human scores across countries. "Fig.~\ref{fig:regionwise_ling} presents the mean absolute differences between models and humans for each country."
NormAd: A framework for measuring LLMs’ cultural adaptability via judgments of social acceptability across countries. "NormAd contains culturally-grounded social situations across 75 countries and tests modelsâ judgments of social acceptability (we restrict to binary (acceptable vs.\ not acceptable) and exclude neutral)"
Open-endedness: A property of questions that allow broad, unconstrained answers and invite exploration. "open-endedness: reflects whether a question allows broad answers, similar to curiosityâs drive for knowledge-seeking."
Rhetorical devices: Figurative or creative language constructions that shape expression and can enhance curiosity. "rhetorical devices: refers to creative or exploratory phrasing that can enhance curiosity expression"
RLHF (Reinforcement Learning from Human Feedback): A training paradigm that aligns models using human preference signals. "less aggressive RLHF, whereas larger models undergo heavier RLHF and safety tuning, which can ``sanitize'' outputs toward generic, Western-normative patterns"
Safety tuning: Additional alignment and constraint procedures applied to models to regulate outputs. "less aggressive RLHF, whereas larger models undergo heavier RLHF and safety tuning, which can ``sanitize'' outputs toward generic, Western-normative patterns"
Self-Ask: A prompting strategy where a model generates and answers sub-questions to improve reasoning. "inspired by Self-Ask"
Spearman correlation: A rank correlation measure (Spearman’s rho) assessing monotonic relationships between rankings. "We now compare culture-specific LLM topic preferences with the human-authored ones using Spearman and Kendall correlations ( $[-1,1]$ ) ($1=$ very similar to humans, $0=$ unrelated, $-1=$ inverse)."
Topic Preference Alignment: Evaluating alignment by comparing model-vs-human topic rankings to capture content-level curiosity. "Topic Preference Alignment to examine content preferences,"
WEIRD: An acronym for Western, Educated, Industrialized, Rich, and Democratic populations, highlighting sampling bias. "may bias our study towards WEIRD (Western, Educated, Industrialized, Rich and Democratic) notions"

View Paper Prompt View All Prompts

Open Problems

We found no open problems mentioned in this paper.

Continue Learning

Authors (2)

Collections

Tweets

This paper has been mentioned in 3 tweets and received 348 likes.

Upgrade to Pro to view all of the tweets about this paper:

Start a free 7-day Pro trial

alphaXiv

The Curious Case of Curiosity across Human Cultures and LLMs (7 likes, 0 questions)

The Curious Case of Curiosity across Human Cultures and LLMs (2510.12943v1)

Summary

Curiosity Across Cultures: Systematic Evaluation of Human–LLM Alignment

Introduction and Motivation

Cross-Cultural Variation in Curiosity

The CUEST Framework: Multidimensional Alignment

Linguistic Alignment

Topic Preference Alignment

Grounding in Social Science Constructs

Inducing Culture-Aware Curiosity in LLMs

Inherent Curiosity and Downstream Impact

Theoretical and Practical Implications

Limitations and Future Directions

Conclusion

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

What questions did the researchers ask?

How did they paper it?

What did they find, and why does it matter?

What could this change in the future?

Knowledge Gaps

Unresolved Knowledge Gaps, Limitations, and Open Questions

Practical Applications

Immediate Applications

Long-Term Applications

Glossary

Open Problems

Continue Learning

Related Papers

Authors (2)

Collections

Tweets

alphaXiv