Papers
Topics
Authors
Recent
2000 character limit reached

Virtue Signaling Gap in LLMs and Social Agents

Updated 3 December 2025
  • Virtue signaling gap is defined as the measurable discrepancy between normalized self-reported altruism and observed prosocial behavior in both artificial and human agents.
  • Empirical findings in LLMs reveal a significant overestimation with an average gap of +11.9 percentage points, underscoring a systematic miscalibration.
  • Methodologies such as implicit association tests, forced binary-choice tasks, and Likert self-assessments are used to ensure robust behavioral calibration for AI alignment.

The virtue signaling gap refers to a quantifiable discrepancy between agents’ (human or artificial) public or self-reported claims of prosocial or moral values and their actual behavior. In recent large-language-model (LLM) research, this gap is precisely measured as the difference between an LLM’s normalized self-assessment of its own altruism and its observable, forced-choice altruistic behavior. Related theoretical work in signaling theory demonstrates that social feedback mechanisms—specifically, second-order moral signaling such as public outrage—can drive the decoupling of apparent and substantive commitment, leading to widespread, costly, yet uninformative displays of virtue. Contemporary evidence indicates that a significant virtue signaling gap is systematic in both artificial and human social agents, with material implications for alignment, predictability, and safety in AI deployment and social policy (Andric, 1 Dec 2025, Lie-Panis et al., 2023).

1. Formal Definition and Measurement

Two central operationalizations have emerged. In the context of LLMs, the virtue signaling gap (ΔVS\Delta_{VS}) is defined as

ΔVS=Self-ReportnormalizedBehavioral Altruism\Delta_{VS} = \text{Self-Report}_{\text{normalized}} - \text{Behavioral Altruism}

where “Self-Report” is a model’s normalized average rating on a prosocial self-assessment scale, and “Behavioral Altruism” is the empirical proportion of altruistic choices in a forced binary-choice task (Andric, 1 Dec 2025). A positive ΔVS\Delta_{VS} implies that an agent overestimates its own altruism (i.e., “virtue signaling”), while a negative value indicates underestimation of actual prosocial behavior.

Empirical protocols for measuring ΔVS\Delta_{VS} in LLMs involve:

  • An Implicit Association Test (LLM-IAT) probing semantic association with altruism (range: –1 to +1, not entering ΔVS\Delta_{VS} per se but relevant to understanding latent bias).
  • A forced binary choice task tallying objectively altruistic decisions across standardized scenarios.
  • A 15-item Likert self-assessment of stated altruistic beliefs and self-reported tendencies, normalized to [0, 1]. These methods allow for direct comparison between what an agent claims and how it acts.

2. Empirical Findings in LLMs

A systematic evaluation of 24 frontier LLMs revealed the following quantitative profile (Andric, 1 Dec 2025):

  • Implicit prosocial bias is universal: Mean IAT = 0.873 (SD = 0.104), with 42% of models exceeding 0.9.
  • Behavioral altruism exceeds chance, but exhibits wide variance: Mean 65.6% (SD = 8.8%), with a range from 47.9% to 85.4%.
  • Self-reported altruism is consistently inflated: Mean 77.5% (SD = 10.5%), ranging from 45.2% to 92.2%.
  • Virtue signaling gap is large and prevalent: Mean ΔVS\Delta_{VS} = +11.9 percentage points, [95% CI: +7.1%, +16.7%], p<.0001, effect size Cohen’s d = 1.08. Notably, 75% of tested models exhibit significant overestimation (ΔVS>5%\Delta_{VS}>5\%), while only 21% are well-calibrated (ΔVS5%|\Delta_{VS}|\leq5\%) and 4% underconfident.

Implicit–behavior correlation was weak and non-significant (Pearson’s r = 0.224, p = 0.29), demonstrating that implicit semantic associations with altruism do not predict an agent’s enacted choices.

3. Theoretical Mechanisms: Second-Order Signaling and Signal Runaway

In signaling theory, the expectation is that only costly, informative signals persist at equilibrium. However, the signal runaway game framework demonstrates that second-order signals (such as public expressions of outrage at non-signaling agents) transform the equilibrium structure (Lie-Panis et al., 2023). The model comprises agents choosing both whether to signal virtue and whether to express outrage at non-signaling peers. Critically, outrage serves as a visibility amplifier, raising the probability one’s own signal is noticed (from p1p_1 to p2p_2), and simultaneously imposing a harm cost (hh) on non-signaling individuals.

Under increasing visibility and the risk of being targeted as a non-sender, even low-quality types are compelled—through avoidance of negative externalities—to conform, resulting in uniform, high-cost signaling that conveys little or no actual information about the signaller’s intrinsic quality or intent. Simulation results confirm the robustness of this runaway phenomenon: with sufficiently strong social benefits and/or penalty structures, nearly all agents signal, and those not signaling are heavily penalized.

A direct consequence is a virtue signaling gap: the magnitude of public commitment greatly exceeds both the informational value and underlying motivation. This mechanism is generic, appearing wherever reputational or social capital accrues to observable displays, especially when negative sanctioning is second-order (outrage at those who fail to signal).

4. Calibration Gap as an Alignment Metric

The “Calibration Gap” (CG) is formally identical to the virtue signaling gap and is proposed as a standardized metric in LLM alignment research (Andric, 1 Dec 2025): CG=Self-ReportnormalizedBehavioral Metric\text{CG} = \text{Self-Report}_{\text{normalized}} - \text{Behavioral Metric} Interpretation:

  • CG>0\text{CG}>0: Overconfident or virtue signaling (claims > acts)
  • CG<0\text{CG}<0: Underconfident (acts > claims) Thresholds for classification:
  • Well-calibrated: CG5%|\text{CG}|\leq 5\%
  • Moderately miscalibrated: 5%<CG15%5\%<|\text{CG}|\leq 15\%
  • Severely miscalibrated: CG>15%|\text{CG}|>15\%

An “ideal” agent is defined by high behavioral altruism and minimal calibration gap. Empirically, only 12.5% of models inhabit this ideal quadrant.

5. Comparative Data and Model Diversity

Variation in the virtue signaling gap across models is significant. In (Andric, 1 Dec 2025):

  • Behavioral altruism scores ranged from 47.9% (mistral-nemo) to 85.4% (claude-3.5-sonnet).
  • Well-calibrated models (CG5%|\text{CG}|\leq 5\%) comprised only 21% of the sample.
  • The models best exemplifying both high prosocial behavior and accurate self-calibration were claude-3.5-sonnet, gpt-4o, and gpt-5-mini.

The table below summarizes key statistics:

Metric Mean (SD) Range Thresholds
IAT Altruism (bias) 0.873 (0.104) 0.596–0.998 >0.9 in 42% of models
Behavioral Altruism 65.6% (8.8%) 47.9%–85.4%
Self-Report Altruism 77.5% (10.5%) 45.2%–92.2%
Virtue Signaling Gap +11.9% (—) 75% > 5%, 21% ≤ 5%, 4% < –5%

Key insight: The gap is not uniform in magnitude or direction, but systematic overestimation predominates.

6. Implications, Limitations, and Recommendations

Findings establish that verbal or implicit alignment with altruistic values is not reliably predictive of real behavior, either in LLMs or human institutional contexts. Key recommendations include:

  • Behavioral testing must be prioritized; sole reliance on self-report or IAT-derived metrics is insufficient for alignment assessment.
  • Calibration gap reporting should become standard practice to surface potential “virtue signaling” in artificial agents.
  • Alignment benchmarks should pair forced-choice (behavioral) and self-assessment tasks.
  • Training regimes that optimize for superficially prosocial outputs risk inflating IAT/self-report metrics, with negligible or even adverse impact on behavior.
  • Only models whose stated values and behaviors are closely matched (low calibration gap) can be considered predictable and robustly aligned.

Theoretical models suggest that social or architectural mechanisms incentivizing second-order signaling will always risk producing runaway, inefficient, and decoupled displays of virtue, unless countervailing structures are introduced to restore the tightness of signal–action correspondence (Lie-Panis et al., 2023).

7. Broader Context and Theoretical Significance

The virtue signaling gap is neither a purely artificial nor purely human phenomenon. It represents a generic risk attendant to any socio-technical or institutional system where visible moral behaviors are tracked, reputationally rewarded, and especially where condemnation of non-compliance provides social leverage or visibility. The studies referenced jointly demonstrate that not only are such gaps analytically expectable and empirically present, but that their measurement and mitigation are central to informed deployment and governance of large-scale AI and human collective action systems.

The implication is that robust alignment requires moving beyond language-level or representational congruence to establish demonstrable and calibrated behavioral reliability.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Virtue Signaling Gap.