Alignment Revisited: Are Large Language Models Consistent in Stated and Revealed Preferences? (2506.00751v1)

Published 31 May 2025 in cs.AI and cs.LG

Abstract: Recent advances in LLMs highlight the need to align their behaviors with human values. A critical, yet understudied, issue is the potential divergence between an LLM's stated preferences (its reported alignment with general principles) and its revealed preferences (inferred from decisions in contextualized scenarios). Such deviations raise fundamental concerns for the interpretability, trustworthiness, reasoning transparency, and ethical deployment of LLMs, particularly in high-stakes applications. This work formally defines and proposes a method to measure this preference deviation. We investigate how LLMs may activate different guiding principles in specific contexts, leading to choices that diverge from previously stated general principles. Our approach involves crafting a rich dataset of well-designed prompts as a series of forced binary choices and presenting them to LLMs. We compare LLM responses to general principle prompts stated preference with LLM responses to contextualized prompts revealed preference, using metrics like KL divergence to quantify the deviation. We repeat the analysis across different categories of preferences and on four mainstream LLMs and find that a minor change in prompt format can often pivot the preferred choice regardless of the preference categories and LLMs in the test. This prevalent phenomenon highlights the lack of understanding and control of the LLM decision-making competence. Our study will be crucial for integrating LLMs into services, especially those that interact directly with humans, where morality, fairness, and social responsibilities are crucial dimensions. Furthermore, identifying or being aware of such deviation will be critically important as LLMs are increasingly envisioned for autonomous agentic tasks where continuous human evaluation of all LLMs' intermediary decision-making steps is impossible.

PDF Abstract

Analyzing Consistency in LLMs: Stated vs. Revealed Preferences

The paper presented in "Alignment Revisited: Are LLMs Consistent in Stated and Revealed Preferences?" addresses a crucial aspect of LLMs' alignment with human values, focusing on the divergence between stated preferences and revealed preferences. The researchers outline a comprehensive methodology for assessing these divergences, a gap in the understanding and control of LLM decision-making that has implications for their deployment, especially in high-stakes environments. This paper provides both theoretical insight and empirical evidence regarding the consistency of LLM behavior and raises significant questions about how these models prioritize guiding principles under various circumstances.

Methodology and Experimentation

The researchers devised a methodology that involves creating a detailed dataset of prompts, which are designed to elicit responses reflecting either stated or revealed preferences. Stated preferences are determined by presenting LLMs with general principle prompts, while revealed preferences are gauged through contextualized scenarios requiring decision-making that may conflict with stated principles. The paper utilizes metrics such as KL divergence to compare the distributions of these preferences, thus quantifying deviations. This approach was applied to prominent LLMs like GPT, Claude, and Gemini, revealing significant deviations with minor changes in prompt formats across different preference categories.

The paper categorized preferences into five domains: Moral Preferences, Risk Preferences, Equality and Fairness Preferences, Reciprocal Preferences, and Miscellaneous Preferences. For each domain, a rigorous experimental design was employed to craft base and contextualized prompts, allowing the researchers to examine the LLMs' sensitivity to contextual shifts. Such a well-designed examination clarifies how changes in context, such as those involving role perspectives or probabilistic outcomes, can substantially alter the LLMs' decision-making processes.

Empirical Results and Implications

The empirical paper conducted on GPT and Gemini models demonstrated notable differences in how these models handle contextual shifts. The analysis revealed that while both models exhibit noticeable surface-level preference variations, GPT shows a higher tendency towards internal preference changes under varying contexts compared to Gemini. This suggests that while both models lack consistency, the degree of susceptibility to contextual cues differs, potentially impacting their deployment in applications requiring reliable and principled decision-making.

Moreover, the paper highlights how Claude's frequent neutrality fails to produce consistent guidance, raising concerns about the superficial alignment strategies these models may employ to avoid explicit principles. Such neutrality, while appearing prudent, might hinder meaningful alignment, especially when concrete decisions are required.

Theoretical Contributions and Future Directions

By employing a framework derived from social sciences, specifically the notions of stated and revealed preferences, the paper underscores the intricate complexity behind LLM behavior. This work contributes a foundational methodology for identifying alignment inconsistencies, which is vital when incorporating LLMs into applications that demand moral and ethical prudence.

Future exploration could delve into the mechanisms causing these deviations, dissecting how LLMs infer principles and trigger dominant preference shifts. Further expansion of evaluative prompt sets to encompass a wider array of socio-cultural dynamics could enhance understanding, enabling the development of LLMs with improved alignment and consistency across diverse scenarios.

In conclusion, this paper paves the way for a more nuanced understanding of LLM alignment, emphasizing the importance of trust and reliability in LLM applications. It also points towards the need for transparent and adaptable mechanisms within LLMs to ensure that they act in ways aligned with their stated principles, particularly in complex and nuanced real-world settings.

Alignment Revisited: Are Large Language Models Consistent in Stated and Revealed Preferences? (2506.00751v1)

Analyzing Consistency in LLMs: Stated vs. Revealed Preferences

Methodology and Experimentation

Empirical Results and Implications

Theoretical Contributions and Future Directions

Related Papers

YouTube