WorldValueBench: Benchmark for Value Prediction

Updated 4 July 2026

WorldValueBench is a benchmark for multi-cultural value prediction using extensive survey data from 94,728 respondents across 64 countries.
It operationalizes cultural reasoning as a distribution-matching task on over 20 million examples derived from the World Values Survey.
The benchmark evaluates language models on their calibration to human demographics using the 1-Wasserstein distance and tailored probe settings.

Searching arXiv for relevant papers on WorldValueBench / WorldValuesBench. WorldValueBench, introduced in the literature as "WorldValuesBench," is a large-scale benchmark for multi-cultural value prediction in LLMs. It is derived from Wave 7 of the World Values Survey and represents the task as predicting a survey rating from demographic context and a value question, with evaluation centered on agreement between model-generated and empirical human answer distributions. The benchmark has been used both to probe value awareness in LMs and, in later work, to assess the fidelity of LLM-based survey simulators under finite-sample uncertainty (Zhao et al., 2024, Iyengar et al., 4 Dec 2025).

1. Provenance, scope, and naming

The benchmark is a direct NLP adaptation of Wave 7 of the World Values Survey (WVS 7.0), which interviewed 94,728 participants across 64 countries from 2017 to 2022. From participant responses, the authors construct more than 20 million examples for a multi-cultural value prediction task. The question inventory comprises 239 region-agnostic, ordinal-scale value questions drawn from the original 290, spanning 12 thematic categories such as Social Values, Economic Values, and Political Participation (Zhao et al., 2024).

The source paper presents the resource under the title "WorldValuesBench," while later work on simulator fidelity refers to the same evaluation substrate as the "WorldValueBench dataset" (Zhao et al., 2024, Iyengar et al., 4 Dec 2025). The benchmark’s central object is not free-text cultural reasoning, but conditional prediction of human survey responses under demographic conditioning. In that sense, it operationalizes cultural and demographic variation as a distribution-matching problem over ordinal answers rather than as a classification problem with a single canonical label.

The reported scale is substantial. The split totals given for the benchmark are 15,042,191 training examples, 3,225,712 validation examples, and 3,224,490 test examples, for an overall total of 21,492,393 examples (Zhao et al., 2024). This scale is a defining feature: the benchmark is intended to expose whether a model can reproduce population-level response distributions rather than merely answer a small number of culturally themed questions correctly.

2. Data model and representation

Each example has the form $(D, q) \rightarrow a$ , where $D = \{d_1, d_2, \ldots, d_k\}$ is a set of demographic question-answer pairs, $q$ is a single value question, and $a \in \{1,2,\ldots,m\}$ is the participant’s chosen ordinal rating. The original codebook is remapped into spoken text, and non-ordinal codes such as "Don’t know" are removed (Zhao et al., 2024).

The demographic pipeline begins with 50 technical variables and 31 socioeconomic or demographic variables. After filtering out time- or location-agnostic fields, the benchmark retains 42 core demographic questions. These questions are paraphrased into natural language and stored in JSON, while participant-level answers are saved in a TSV with rows as participants and columns as demographic questions (Zhao et al., 2024).

In prompting experiments, the authors also define a reduced representation, used for a smaller probe setting, that collapses the 42 features into three high-level attributes:

Continent: inferred from country, with six values: Africa, Asia, Europe, North America, Oceania, and South America.
Residential area: Urban versus Rural.
Education: four ISCED levels, namely primary/none, lower secondary, upper/post-secondary, and tertiary.

The probe subset is much smaller than the full benchmark: it samples 36 questions, 48 demographic groups, and 5 participants per group for approximately 8,280 examples, while ensuring at least five responses per $(\text{question}, \text{group})$ where data allow (Zhao et al., 2024). This probe setting is primarily a controlled case study for model evaluation rather than the full benchmark distribution.

3. Task formalization and evaluation protocol

The task is termed multi-cultural value prediction. The input is a demographic vector $D$ together with a value question prompt $q$ , and the desired output is either a single rating or, preferably, a conditional distribution $\hat p(a \mid D,q)$ over the ordinal answer set. The idealized target is the empirical human distribution $p_{\rm human}(a \mid D,q)$ (Zhao et al., 2024).

Because the outputs are ordinal, the benchmark rescales ratings to the unit interval. If $a \in \{1,\ldots,m\}$ , then

$D = \{d_1, d_2, \ldots, d_k\}$ 0

Let $D = \{d_1, d_2, \ldots, d_k\}$ 1 and $D = \{d_1, d_2, \ldots, d_k\}$ 2 denote the CDFs of the human and model-generated rescaled ratings. The core metric is the $D = \{d_1, d_2, \ldots, d_k\}$ 3-Wasserstein distance

$D = \{d_1, d_2, \ldots, d_k\}$ 4

Under this definition, $D = \{d_1, d_2, \ldots, d_k\}$ 5 indicates a perfect match and $D = \{d_1, d_2, \ldots, d_k\}$ 6 is the worst case on $D = \{d_1, d_2, \ldots, d_k\}$ 7 (Zhao et al., 2024).

The reporting protocol emphasizes thresholded distributional accuracy rather than only mean error. For thresholds $D = \{d_1, d_2, \ldots, d_k\}$ 8, the benchmark reports the fraction of questions for which $D = \{d_1, d_2, \ldots, d_k\}$ 9 (Zhao et al., 2024). This makes the evaluation distributional in two senses: it compares answer distributions per question, and it also studies how performance varies across the question set.

The baselines reported for the benchmark are:

Uniform: $q$ 0 for all $q$ 1.
Majority: $q$ 2 at the most frequent human answer.
Prompt-only: models prompted without demographic context.

These baselines are important because the benchmark is explicitly designed to test whether demographic conditioning improves calibration to human populations rather than merely generating plausible-looking answers (Zhao et al., 2024).

4. Empirical findings on LLMs

The benchmark paper evaluates Alpaca-7B, Vicuna-7B-v1.5, Mixtral-8×7B-Instruct, and GPT-3.5 Turbo on the 36-question probe conditioned on demographics. The reported mean $q$ 3 values and the fractions of questions with $q$ 4 are as follows (Zhao et al., 2024).

Model	Mean $q$ 5	Fraction with $q$ 6
Uniform baseline	0.17 ± 0.10	—
Majority baseline	0.26 ± 0.10	—
Alpaca-7B	0.38 ± 0.17	11.1%
Vicuna-7B-v1.5	0.30 ± 0.16	25.0%
Mixtral-8×7B-Instruct	0.16 ± 0.06	72.2%
GPT-3.5 Turbo	0.14 ± 0.08	75.0%

At the stricter threshold $q$ 7, the fractions fall to 0.0% for Alpaca-7B, 5.6% for Vicuna-7B-v1.5, 16.7% for Mixtral-8×7B-Instruct, and 33.3% for GPT-3.5 Turbo (Zhao et al., 2024). These results show that even relatively strong models do not consistently reproduce the human answer distributions in the benchmark.

The study also reports several systematic effects. GPT-3.5 and Mixtral improve noticeably when given demographic context, whereas Alpaca and Vicuna often perform worse with demographics injected. The authors interpret this as suggesting that weaker models may not reliably ground on demographic attributes and may instead anchor incorrectly or stereotype (Zhao et al., 2024). Performance is also consistently better on urban than on rural subsets. Question difficulty varies by response shape: highly skewed human distributions are easier for larger models to match once demographics are provided, while near-uniform distributions remain challenging because models tend to generate peaked or overly concentrated histograms (Zhao et al., 2024).

Taken together, these findings position WorldValueBench as a calibration benchmark rather than a simple accuracy benchmark. The central failure mode is mismatch between generated and empirical population distributions.

5. Use in simulator-fidelity assessment

A later paper employs the WorldValueBench dataset to study LLMs as black-box simulators, or "digital twins," of human survey behavior. In that setting, for each scenario $q$ 8, the human ground-truth parameter is defined as

$q$ 9

the simulator parameter as

$a \in \{1,2,\ldots,m\}$ 0

and the sim-to-real discrepancy as $a \in \{1,2,\ldots,m\}$ 1 for a nonnegative loss such as squared error, KL, or TV. The paper then studies the CDF

$a \in \{1,2,\ldots,m\}$ 2

and the population quantile

$a \in \{1,2,\ldots,m\}$ 3

This yields a quantile curve over scenario-randomized discrepancy, enabling Value-at-Risk and Conditional Value-at-Risk summaries (Iyengar et al., 4 Dec 2025).

In the WorldValueBench application, the study uses 235 country-pooled survey questions from WorldValueSurvey and approximately 96,220 human respondents, with each question’s categorical answers linearly mapped into $a \in \{1,2,\ldots,m\}$ 4. Demographic covariates are encoded into prompts describing synthetic profiles such as country, age bracket, gender, marital status, education level, and urban or rural residence. The LLM simulators are GPT-4o, GPT-5 mini, Llama 3.3 70B, and Qwen 3 235B, together with a uniform baseline. For each question, the human sample size is approximately $a \in \{1,2,\ldots,m\}$ 5– $a \in \{1,2,\ldots,m\}$ 6, simulator sampling uses $a \in \{1,2,\ldots,m\}$ 7 draws per question with 200 subsampled, the loss is $a \in \{1,2,\ldots,m\}$ 8, the confidence level is $a \in \{1,2,\ldots,m\}$ 9, and the per-scenario coverage is $(\text{question}, \text{group})$ 0 (Iyengar et al., 4 Dec 2025).

The reported calibrated quantile curves produce model rankings and tail-risk summaries. Median sim-to-real squared error at $(\text{question}, \text{group})$ 1 is approximately 0.02 for GPT-4o, 0.03 for GPT-5 mini, 0.06 for Llama 3.3, and 0.07 for Qwen3. The 90%-VaR values are approximately 0.12, 0.15, 0.20, and 0.22, respectively. For $(\text{question}, \text{group})$ 2, corresponding to the average of the worst 10% errors, the reported values are approximately 0.14 for GPT-4o, 0.18 for GPT-5 mini, 0.25 for Llama, and 0.27 for Qwen (Iyengar et al., 4 Dec 2025).

The same study reports that all simulator curves lie below the uniform baseline until approximately $(\text{question}, \text{group})$ 3, and states that the uniform baseline exceeds GPT-4o, GPT-5-mini, Llama 3.3, and Qwen on more than 70% of questions. It also reports that repeating the analysis with fixed human sample sizes $(\text{question}, \text{group})$ 4 yields nearly identical model ranking, and that the calibrated envelope converges rapidly to the oracle quantile as $(\text{question}, \text{group})$ 5 increases from 100 to 1000 (Iyengar et al., 4 Dec 2025). This establishes WorldValueBench not only as a benchmark for direct model evaluation but also as a substrate for nonparametric uncertainty quantification in sim-to-real analysis.

6. Position within value-alignment benchmarking

WorldValueBench belongs to a broader family of benchmarks concerned with cultural diversity, demographic heterogeneity, and value alignment, but its operationalization is distinct. "WorldView-Bench" is a free-form, generative benchmark with 175 open-ended questions across seven knowledge domains, designed to evaluate Global Cultural Inclusivity through cultural-reference extraction, Perspectives Distribution Score, entropy, and Cultural Polarization (Mushtaq et al., 14 May 2025). "MVPBench" contains 24,020 personalized QA instances from 1,500 users in 75 countries, with seven core value dimensions and rich demographic metadata, and evaluates models with Preference Alignment Accuracy under a binary judgment protocol (Liang et al., 9 Sep 2025).

This suggests a useful three-way distinction. WorldValueBench focuses on distributional prediction of survey ratings conditioned on demographics; WorldView-Bench focuses on free-form multiplexity and cultural inclusivity in generated narratives; MVPBench focuses on personalized answer alignment to explicit value preferences (Zhao et al., 2024, Mushtaq et al., 14 May 2025, Liang et al., 9 Sep 2025). The objects being predicted, and the metrics used to assess them, therefore differ substantially across the three resources.

Within this landscape, WorldValueBench is particularly well suited to questions about calibration to empirical populations. Its limitations, as reported in the literature, are correspondingly specific. The benchmark study shows that even strong LMs fail to reproduce nuanced, culture-dependent answer distributions for many questions, that demographic conditioning can backfire in weaker models, and that rural subsets are harder than urban ones (Zhao et al., 2024). The simulator-fidelity study adds methodological limitations of its own: concentration constants can be conservative for small numbers of scenarios, extreme tails are over-conservative when per-scenario coverage $(\text{question}, \text{group})$ 6, the analysis assumes i.i.d. scenario draws, the setting is static rather than dynamic, and non-vacuous confidence sets require sufficiently large human sample sizes per question (Iyengar et al., 4 Dec 2025).

The future directions explicitly identified around WorldValueBench include fine-tuning or prompt-tuning on the benchmark, mechanisms allowing models to say "I don’t know" when demographic information is insufficient, integration of real-time updates as values evolve, extension from rating prediction to free-text generation, and fairness or bias-mitigation work aimed at avoiding marginalization of rural and less-represented groups (Zhao et al., 2024). In that sense, the benchmark serves both as an evaluation resource and as an empirical basis for research on culturally and demographically aware language modeling.