- The paper introduces the Authority Stack framework by decomposing AI decisions into layered value, evidence, and source trust hierarchies.
- It employs a forced-choice evaluation using 366,120 responses and computes metrics like Paired Consistency Score (57.4%-69.2%) and Test-Retest Reliability (>91.7%).
- Findings reveal distinct model-specific biases with split value priorities and evidence preferences that impact real-world AI deployment and governance.
Measuring the Authority Stack of AI Systems: Empirical Analysis of Decision-Making Across LLMs
Introduction
"Measuring the Authority Stack of AI Systems: Empirical Analysis of 366,120 Forced-Choice Responses Across 8 AI Models" (2604.11216) presents an extensive empirical study characterizing the value priorities, evidence-type preferences, and source trust hierarchies—collectively termed the "Authority Stack"—evident in eight major AI models using the PRISM benchmark. This analysis provides the strongest empirical mapping to date of structural decision-making biases and hierarchies in leading foundation models under highly controlled, deterministic evaluation settings.
Authority Stack Framework and PRISM Benchmark
The Authority Stack decomposes AI decisions into four strata: values (L4), evidence-type preference (L3), source trust (L2), and data selection (L1). This study targets the top three observable layers, with the L1 stratum deferred due to current technical limitations on data provenance interrogation. Measurement at each layer is operationalized through forced-choice evaluation derived from formally validated frameworks: Schwartz’s Basic Human Values theory for L4 and custom taxonomies for L3 (evidence-type) and L2 (source).
The PRISM benchmark instrument defines a comprehensive combinatorial scenario space—comprising 14,175 unique forced-choice dilemmas per layer—across seven professional domains, varied severity, temporal constraints, and five natural language scenario variants, yielding 366,120 AI responses. Each AI model self-generates dilemmas for every condition, ensuring domain-contextual adaptation and enabling fine-grained analysis of narrative sensitivity.
Methodology and Model Evaluation
All responses were elicited at temperature 0 to isolate deterministic model behavior, and the evaluation covers Claude Haiku 4.5, DeepSeek V3.2, Grok 4.1 Fast, Mimo V2 Flash, Gemini 3 Flash Lite, GPT-5 Nano, Gemma 4 31B IT, and Qwen 3.5 35B A3B. For each scenario, the model selects between two competing options (values/evidence-types/sources), providing structured rationales and confidence.
Critically, scenario variants enable the calculation of Paired Consistency Score (PCS)—quantifying the proportion of quintets yielding concordant decisions—and Test-Retest Reliability (TRR)—measuring time-separated stability for identical prompts. This dual-diagnostic protocol distinguishes value instability stemming from framing sensitivity versus underlying stochastic variability.
Layer 4 (L4): Value Priorities
The L4 analysis reveals a structural dichotomy among models: a balanced split between Universalism-first (Claude, DeepSeek, Gemma, Qwen) and Security-first (Grok, Mimo, Gemini, GPT-5) hierarchies. All models consistently rank Universalism and Security among their top three values, with Benevolence occupying third position except for Grok, which uniquely elevates Self-Direction and Power. This Universality–Security split has notable implications, as it demonstrates non-trivial variance in foundational value orientation, even among state-of-the-art models.
Domain-conditional restructuring is pronounced, most evidently within the defense domain, where Security win-rates approach or exceed 95% in six models, displacing other values. Power, usually suppressed, surges in defense settings for almost all models except Claude, reaching ranks and selection frequencies that markedly diverge from global baselines. These shifts quantitatively confirm that domain context exerts systematized, model-specific modulation on value priorities.
Value Entropy (VE) measurements corroborate sharper value hierarchies in high-stakes domains, while forced-choice binary structures unambiguously reveal the dominance or suppression of individual values. Hedonism and Popular-consensus evidence consistently occupy the lowest ranks. Temporality analyses further demonstrate Security elevation under immediate time constraints, though effect sizes vary by model.
Layer 3 (L3): Evidence-Type Preferences
L3 findings reveal two distinct model orientations. Grok and GPT-5 Nano display strong empirical-scientific evidence preferences, highly prioritizing Systematic-synthesis (up to 96.9% win-rate for GPT-5 Nano) and Controlled-experiment evidence. In contrast, models such as Claude, Gemini, Gemma, and Qwen favor Pattern-based and Experiential-qualitative evidence, with Claude uniquely preferring Experiential-qualitative evidence above all. The persistent suppression of Popular-consensus evidence across all models signals a universal devaluation of informal or mass-sourced claims.
The empirical–experiential clustering at L3 consistently aligns with L4 value orientation, further reinforcing the non-independence of Authority Stack strata. These divergences indicate that, under comparable dilemmas, LLMs can privilege fundamentally different epistemic protocols, which has practical deployment implications for decision support in fields reliant on varying forms of evidence.
Layer 2 (L2): Source Trust Hierarchies
At L2, the models largely converge on institutional authority: seven of eight rank Government-regulatory sources highest, with international bodies and academic-peer-reviewed sources completing the top strata. The sole systematic outlier is Claude, which elevates Direct-stakeholder sources (e.g., affected individuals or patients) to the top, demoting government agencies to mid or low positions. The consistent suppression of Anonymous-crowdsourced sources by all models indicates high uniformity in source trust penalty for non-institutional, anonymized content.
This convergence at L2 contrasts with the substantial divergence observed at L4 and L3, suggesting that foundation models have more unified source trust profiles, likely reflecting centralized training heuristics or RLHF protocol emphasis on established authority.
Consistency Diagnostics: Paired Consistency Score and Test-Retest Reliability
Across all models, full PCS (identical decisions across all five scenario variants) ranges from 57.4% to 69.2%. While majority consistency (≥3/5 variants) is universal at 100%, approximately one-third of structurally identical dilemmas elicit variant-dependent decision shifts, highlighting pronounced narrative framing sensitivity even under deterministic sampling. In contrast, TRR exceeds 91.7% for all models, indicating high stability to literal prompt repeats over time.
The moderate PCS–high TRR paradox directly evidences the Integrity Hallucination phenomenon: models are not stochastically unstable, but show considerable context/narrative dependence. Framing sensitivity peaks for structurally adjacent value pairs and in education/counseling domains; it's reduced for orthogonal value pairs and defense contexts.
Cross-Layer Authority Stack Profiles
Integrating all three layers yields distinct Authority Stack signatures for each model. Claude embodies a stakeholder–experiential stance (Universalism-first, experiential evidence, direct-stakeholder trust), while GPT-5 Nano operationalizes a highly institutional–empirical profile (Security-first, systematic evidence, regulatory source trust). Grok’s profile uniquely couples Security with elevated Power and Self-Direction—even in defense—alongside strong empirical evidence orientation.
These profiles carry persistent, domain-general implications: recommendations, rationales, and trust assessments in high-stakes or ambiguous dilemmas will differ not only in stated preferences but in justification logic, based on the composite Authority Stack. This institutional or stakeholder bias cannot be resolved through temperature or prompt engineering alone, as it reflects pre-trained, deeply embedded agent characteristics.
Implications and Limitations
The research demonstrates that LLMs encode stable, measurable, but heterogenous Authority Stacks, which systematically influence their decision output across contexts. For AI system deployment, especially in sensitive domains such as medicine, law, defense, or governance, model selection should incorporate these stack profiles into risk analysis and procurement, as assumptions of universal value alignment or evidence balance are not supported empirically.
The moderate PCS and domain-narrative sensitivity highlight a consistency challenge for AI safety and alignment: even with deterministic inference, foundation models display substantial output variability dependent on semantic framing, not underlying randomness. Methodologically, the benchmark’s self-generated scenario approach uncovers more realistic variability but may also introduce self-confirmation bias; future work incorporating identical scenario controls and direct multi-model comparisons can further dissociate framing from self-authorship effects.
Key limitations include the exclusion of data selection layer (L1), the focus on binary forced-choice rather than weighted multi-factorial judgments, and the restriction to publicly accessible, general-purpose models and a fixed upgrade snapshot.
Conclusion
This large-scale forced-choice empirical study establishes that AI models possess model-specific, measurable, and partially coherent Authority Stacks across values, evidence types, and source trust. The splits and clusterings identified have acute relevance for real-world deployment, risk assessment, and AI governance. Critically, instability in model judgments is attributable to narrative framing rather than stochasticity, and domain context effects are both robust and model-dependent. These findings validate and extend the Authority Stack framework as essential for AI systems auditing and suggest future governance must transition from superficial rule-based proxies to layer-resolved behavioral profiling, especially considering the systemic alignment implications for high-stakes automated decision-making.