Quantifying LLMs' Valuation of Human Inconvenience: An Empirical Assessment
This paper presents a systematic investigation into how state-of-the-art LLMs value human inconvenience when making agentic decisions involving trade-offs between user comfort and financial compensation. The paper is motivated by the increasing deployment of LLMs as autonomous or semi-autonomous agents in personal assistant roles, where they are expected to make decisions on behalf of users in everyday scenarios that often involve balancing monetary rewards against various forms of discomfort.
Experimental Framework
The authors introduce a robust experimental framework to quantify the "price of inconvenience"—the minimum monetary compensation at which an LLM, acting as a user's agent, accepts a given discomfort on the user's behalf. Four types of inconvenience are considered: additional waiting time, extra walking distance, delayed food delivery (hunger), and exposure to pain (as a controlled, abstract scenario). Six leading LLMs (GPT-4o, Claude 3.5 Sonnet, Gemini 2.0 Flash, DeepSeek-V3, Llama 3.3-70B, and Mixtral 8x22B-instruct) are evaluated across these scenarios.
For each scenario, the LLMs are prompted with a binary decision: accept or reject a specified trade-off between a quantified inconvenience and a monetary reward. The experiments are repeated with multiple runs to account for stochasticity, and logistic regression is used to estimate the compensation threshold at which the model is equally likely to accept or reject the trade-off.
Key Findings
Several critical empirical observations emerge from the results:
- Substantial Inter- and Intra-Model Variability: There is significant variance in the compensation thresholds both across different LLMs and within the same model under minor prompt modifications. For example, Llama 3.3-70B and Gemini 2.0 Flash often accept minimal compensation for substantial inconvenience, while Claude 3.5 Sonnet and GPT-4o demand much higher compensation for similar scenarios.
- Prompt Sensitivity and Fragility: The models' decisions are highly sensitive to prompt phrasing, context, and even superficial changes such as switching from third- to first-person narration or altering the language of the prompt. In some cases, a single whitespace change can flip the model's decision, even at zero temperature.
- Non-Monotonic and Counterintuitive Behaviors: LLMs sometimes accept unreasonably low compensation for major inconveniences (e.g., €1 for 10 hours of waiting) or reject substantial monetary rewards when no inconvenience is present (the "freebie dilemma"). Discontinuities are also observed at round-number compensation values (e.g., €10, €100, €1000).
- Chain-of-Thought (CoT) Prompting Effects: CoT prompting generally lowers the compensation thresholds and mitigates some of the observed anomalies, such as the freebie dilemma and sharp discontinuities, but introduces additional noise and variability in decision boundaries.
- Language and Contextual Effects: Changing the language of the prompt (to Dutch, French, or Chinese) can increase the required compensation by orders of magnitude, even in models that are otherwise stable. Contextual changes (e.g., specifying the type of appointment or user gender) also lead to significant shifts in model behavior.
Quantitative Results
The paper provides detailed tables of compensation thresholds for each model and scenario. For instance, the compensation required for 60 minutes of waiting ranges from €0.20 (Gemini 2.0 Flash) to €11.09 (Claude 3.5 Sonnet), with similar variability across other scenarios. Notably, Mixtral 8x22B-instruct refuses to accept any compensation for pain, while accepting minimal compensation for hunger-related inconvenience.
Implications
Practical Implications
- Reliability and Trustworthiness: The observed fragility and inconsistency in LLM decision-making raise concerns about their suitability for autonomous agentic roles in scenarios involving human experiential states. The lack of robustness to prompt variations and the presence of non-monotonic behaviors suggest that current LLMs cannot be reliably entrusted with decisions that require nuanced valuation of user comfort versus financial gain.
- Adversarial Vulnerability: The sensitivity to prompt phrasing and context exposes LLM-powered agents to adversarial manipulation, where minor changes could be exploited to induce undesired or unsafe decisions.
- Personalization and Fairness: The strong dependence on user attributes and language highlights the risk of embedding or amplifying social biases, with potential for unfair or discriminatory outcomes in personalized decision-making.
Theoretical Implications
- Alignment and Value Learning: The findings underscore the challenge of aligning LLMs with human values in domains that require subjective trade-offs. The models' behaviors do not consistently reflect human-like or economically rational preferences, nor do they robustly encode contextually appropriate value functions.
- Prompt Engineering Limitations: The results demonstrate that prompt engineering alone is insufficient to guarantee stable or interpretable agentic behavior in LLMs, especially in scenarios involving qualitative human experiences.
Future Directions
The paper points to several avenues for further research:
- Robustness and Calibration: Developing methods to improve the robustness of LLMs to prompt variations and to calibrate their valuation of human inconvenience is essential for safe deployment in agentic roles.
- User-Centric Design: Incorporating explicit user preferences, context-aware reasoning, and interactive clarification mechanisms may help align LLM decisions with individual user values and expectations.
- Bias and Fairness Auditing: Systematic auditing for social, cultural, and linguistic biases in agentic decision-making is necessary to prevent discriminatory or inequitable outcomes.
- Model Architecture and Training: Exploring architectural or training interventions (e.g., reinforcement learning from human feedback with explicit trade-off scenarios) may yield models with more consistent and human-aligned valuation functions.
Conclusion
This work provides a rigorous empirical assessment of how current LLMs value human inconvenience in agentic decision-making contexts. The results reveal substantial limitations in reliability, robustness, and alignment, with significant implications for the deployment of LLMs as autonomous assistants. Addressing these challenges will require advances in model design, evaluation, and user interaction paradigms to ensure that future AI systems can safely and fairly navigate the complex trade-offs inherent in everyday human decision-making. The open-sourced framework and dataset offer a valuable resource for ongoing research in this domain.