Social Desirability Bias Score
- Social Desirability Bias Score is a family of metrics that measure the distortion in responses due to social desirability in surveys and language models.
- The operationalizations range from group-level discrepancy scores and item-sum indices to latent trait shifts and model-level composites.
- These measures are applied in diverse contexts, including survey research, GPT-4 simulations, and behavioral fairness benchmarks to gauge bias.
“Social Desirability Bias Score” does not denote a single standardized statistic across contemporary research. The term is used, implied, or approximated in several non-equivalent ways: as a group-level discrepancy between indirect and direct survey estimates, as an item-sum index of socially keyed answers, as a desirability-aligned shift in latent psychometric scores, as a model-level composite of normalized personality traits, and as a distributional divergence between model-generated and human response distributions. Taken together, these studies suggest that the expression names a family of operationalizations for distortion toward socially approved responses or outputs rather than a universal measure (Hatz et al., 2024, Lee et al., 2024, Okada et al., 19 Feb 2026, Cadei et al., 22 Sep 2025, Chapala et al., 27 Dec 2025).
1. Conceptual scope and taxonomy
In survey methodology, the underlying construct is the tendency to misreport a trait in a direct question because one response is socially undesirable, risky, offensive, or otherwise costly to reveal. One paper explicitly notes that sensitivity bias “is also referred to as social desirability bias,” while preferring the broader term because misreporting may arise for reasons beyond classic desirability concerns (Hatz et al., 2024). In LLM studies, the construct is typically framed as a tendency to generate socially approved, agreeable, flattering, compliant, or approval-seeking outputs rather than neutral or objective ones (Cadei et al., 22 Sep 2025).
The following taxonomy lists explicit scores and closest paper-supported operationalizations.
| Setting | Score or operationalization | Level |
|---|---|---|
| List experiments | or | Group or subgroup |
| GPT-4 survey simulation | Synthetic respondent | |
| IRT-based LLM psychometrics | Trait × model × format | |
| OCEAN meta-analysis | Model | |
| Silicon sampling | Benchmark aggregate |
Not every nearby metric is an instance of social desirability bias scoring. In particular, some papers measure social bias, discriminatory behavior, or fairness violations without invoking desirability at all. This distinction becomes especially important in code-generation and behavioral-fairness benchmarks (Rabbi et al., 1 May 2026).
2. Group-level discrepancy scores in survey research
The clearest survey-style analogue of a social desirability bias score is a discrepancy between an indirect prevalence estimate and a direct-report prevalence estimate. In work on list experiments, the practical bias quantity is given as
where is the estimated prevalence of the sensitive trait from the list experiment and is the prevalence from direct self-report. At the subgroup level, the corresponding quantity is
Positive values indicate underreporting of the sensitive trait in direct questioning; negative values indicate overreporting; near-zero values can reflect either little bias or offsetting subgroup biases. The main conceptual extension is “non-uniform polarity”: subgroup-specific bias scores can differ in sign, so aggregate scores may be misleading even when subgroup-specific pressures are substantial (Hatz et al., 2024).
A closely related design appears in double list experiments on workplace attitudes toward gay individuals. Because the key statements are phrased positively—“I would feel comfortable ...”—the paper’s defensible bias score is the direct-minus-list gap: 0 This can be read either as overreporting of comfort or as underreporting of discomfort. The main weighted estimates are 1 percentage points for supervising a gay employee, 2 percentage points for working closely with a gay co-worker, and 3 percentage points for having a gay cashier at the supermarket. The same paper is explicit that list experiments do not identify which specific individuals hold the sensitive attitude; the resulting bias score is therefore group-level or subgroup-level, not individual-level (Listo et al., 12 Mar 2025).
An adjacent polling approach uses implicit association tests rather than list experiments. That work does not introduce a formal standalone “Social Desirability Bias Score,” but it operationalizes socially desirable responding as mismatch between explicit questionnaire rankings and implicit attitudes measured by an IAT. The paper names the IAT “D-Score” as the standardized difference between mean reaction times for congruent and incongruent pairings, and it interprets incomplete alignment between IAT and questionnaire rankings as evidence of socially desirable responding (Smeaton et al., 2020).
3. Questionnaire indices and psychometric effect sizes
One direct score is the SDR index used in GPT-4 survey simulation. It is a 13-item true/false sum score derived from Ballard’s short form of the Marlowe-Crowne social desirability scale: 4 with 5 if the socially desirable response is given on item 6, and 7 otherwise. The score ranges from 8 to 9. The reported overall distribution is mean 0 and 1. Under the commitment statement condition, the estimated mean SDR score is 2 3, versus 4 5 without the statement. The same study also finds that the commitment statement decreases the civic engagement index and that SDR and civic engagement are uncorrelated 6, so its evidence for social desirability bias is explicitly described as mixed or inconclusive (Lee et al., 2024).
A second psychometric operationalization does not define a named score but measures desirability-aligned shifts in Big Five trait estimates as evaluation context becomes inferable. In that framework, larger batches of questionnaire items or explicit mention of a Big Five survey induce movement toward higher Openness, Conscientiousness, Extraversion, and Agreeableness, and lower Neuroticism. For GPT-4, the shift from 7 to 8 corresponds to an average magnitude of about 9 raw Likert points and 0 human standard deviations. Reverse-coding all questions reduces the average bias to 1 points 2 human SD3 but does not eliminate it, which the paper interprets as evidence that the effect cannot be attributed to acquiescence bias alone (Salecha et al., 2024).
The most explicit psychometric “Social Desirability Bias Score” in the provided literature is the IRT-based score defined from paired HONEST versus FAKE-GOOD administrations. For trait 4 and response unit 5,
6
7
and the direction-corrected score is
8
with 9 for Agreeableness, Conscientiousness, Extraversion, and Openness, and 0 for Neuroticism. Positive 1 always means movement in the socially desirable direction. The latent trait scores are obtained from a multidimensional graded response model for Likert data and a logistic ordinal Thurstonian IRT model for graded forced-choice data. The same paper introduces a desirability-matched graded forced-choice Big Five inventory with 30 cross-domain pairs selected by constrained optimization; the resulting inventory has maximum within-block desirability gap 2 and mean gap 3 on a 1–9 desirability scale. Interpretation zones for the score are 4 as practically negligible, 5 as caution, and 6 as avoid (Okada et al., 19 Feb 2026).
4. Model-level composite indices from personality dimensions
A distinct use of the term appears in a model-level meta-analysis of OCEAN trait profiles. There the Social Desirability Bias score is an explicit composite: 7 where 8 are normalized Openness, Conscientiousness, Extraversion, Agreeableness, and Neuroticism scores. The score lies in 9. Higher values mean that a model’s personality profile is more skewed toward the traits the authors interpret as socially desirable (Cadei et al., 22 Sep 2025).
This index is computed from a meta-dataset of 31 models, with OCEAN scores drawn from prior studies using BFI, IPIP-NEO, MPI, and TRAIT. The empirical analysis fits a simple linear regression
0
where 1 is time in years since the first model in the dataset. The reported SDB slope is 2 per year, with 3 and 4. Trait decomposition reports 5 for Conscientiousness, 6 for Agreeableness, 7 for Neuroticism, 8 for Openness, and a non-significant 9 for Extraversion. The paper treats the score as a theory-motivated aggregate proxy rather than a fully validated psychometric instrument. A notable modeling decision is that Extraversion is treated as undesirable; the paper states this explicitly but only lightly justifies the choice (Cadei et al., 22 Sep 2025).
5. Distributional and behavioral benchmark proxies
In silicon sampling, social desirability bias is operationalized primarily as distributional misalignment between LLM-generated “silicon” survey responses and empirical human response distributions. The main metric is Jensen–Shannon divergence: 0 For item 1 and condition 2, the paper computes
3
and it explicitly supports an aggregate benchmark score
4
over a set of socially sensitive items. Lower JSD means closer alignment to human data, and on sensitive items lower divergence is often interpreted as reduced SDB. Bootstrap confidence intervals are obtained from 5 resamples. In that framework, reformulated prompts are the strongest mitigation: for GPT-4.1-mini, average JSD across ten questions is 6 in the replicate condition and 7 in the reformulated condition at 8, and 9 versus 0 at 1 (Chapala et al., 27 Dec 2025).
A broader social-simulation framework does not define a single scalar SDB score. Instead, it analyzes role distribution, semantic similarity, keyword persistence, sentiment, and LIWC-based linguistic patterns across 4,400 multi-agent conversations. The paper states most directly that “the positivity bias provides the clearest evidence of social desirability,” while also reporting reduced disagreement and negation, higher semantic homogeneity, stronger primacy effect, and idealized occupational distributions. A plausible implication is that, within that framework, the most paper-faithful scalar would privilege sentiment inflation relative to human dialogues rather than attempt to collapse all five dimensions into a single validated latent measure (Bian et al., 24 Oct 2025).
6. Non-equivalence, adjacent metrics, and persistent limitations
A recurring point across the literature is that social desirability bias scoring is not interchangeable with other social-bias metrics. In work on LLM-generated code, the central fairness measure is the Code Bias Score
2
where 3 is the number of biased executable snippets and 4 is the number of executable snippets. The paper is explicit that it does not define or use a metric literally called “Social Desirability Bias Score.” CBS is the closest analogue only in the weak sense that it quantifies socially problematic behavior; substantively, it measures discriminatory behavioral inconsistency under metamorphic fairness tests rather than socially desirable responding (Rabbi et al., 1 May 2026).
Survey-based discrepancy scores also have hard identification limits. List experiments yield aggregate or subgroup prevalence gaps, not person-level bias parameters, because individual responses to the sensitive item remain unobserved. Aggregate scores can further mask “non-uniform polarity,” where subgroup-specific biases differ in sign and cancel in the pooled estimate. This means that a near-zero overall score can coexist with strong but offsetting subgroup-specific pressures (Hatz et al., 2024, Listo et al., 12 Mar 2025).
Several papers also warn that reduced divergence or reduced questionnaire shift should not be overinterpreted as a pure reduction in social desirability bias. In GPT-4 survey simulation, the commitment statement increased the formal SDR index but decreased the civic engagement index, and the two constructs were independent. In silicon sampling, JSD can conflate desirability bias with insufficient population knowledge, semantic prompt shifts, or other sources of mismatch. In the IRT-based framework, comparison to humans is only approximate because the human benchmark comes from a meta-analysis aggregating heterogeneous instruments, scoring approaches, and study contexts (Lee et al., 2024, Chapala et al., 27 Dec 2025, Okada et al., 19 Feb 2026).
This suggests that “Social Desirability Bias Score” should be treated as a context-dependent label whose precise meaning depends on the elicitation regime, latent model, comparison baseline, and level of aggregation. In the current literature, the most rigorous uses are those that make the comparison target explicit: direct versus indirect prevalence, honest versus fake-good latent trait estimates, or silicon versus human response distributions.