Beyond Preferences in AI Alignment (2408.16984v2)

Published 30 Aug 2024 in cs.AI

Abstract: The dominant practice of AI alignment assumes (1) that preferences are an adequate representation of human values, (2) that human rationality can be understood in terms of maximizing the satisfaction of preferences, and (3) that AI systems should be aligned with the preferences of one or more humans to ensure that they behave safely and in accordance with our values. Whether implicitly followed or explicitly endorsed, these commitments constitute what we term a preferentist approach to AI alignment. In this paper, we characterize and challenge the preferentist approach, describing conceptual and technical alternatives that are ripe for further research. We first survey the limits of rational choice theory as a descriptive model, explaining how preferences fail to capture the thick semantic content of human values, and how utility representations neglect the possible incommensurability of those values. We then critique the normativity of expected utility theory (EUT) for humans and AI, drawing upon arguments showing how rational agents need not comply with EUT, while highlighting how EUT is silent on which preferences are normatively acceptable. Finally, we argue that these limitations motivate a reframing of the targets of AI alignment: Instead of alignment with the preferences of a human user, developer, or humanity-writ-large, AI systems should be aligned with normative standards appropriate to their social roles, such as the role of a general-purpose assistant. Furthermore, these standards should be negotiated and agreed upon by all relevant stakeholders. On this alternative conception of alignment, a multiplicity of AI systems will be able to serve diverse ends, aligned with normative standards that promote mutual benefit and limit harm despite our plural and divergent values.

Citations (4)

View on Semantic Scholar

Summary

The paper challenges the reliance on scalar reward functions by demonstrating their limitations in capturing complex and context-specific human values.
It introduces resource rationality as a superior framework that more accurately models bounded human decision-making compared to traditional rational choice theory.
The work advocates for contractualist, socially-informed approaches that align AI systems with dynamic, task-specific normative standards.

Beyond Preferences in AI Alignment: A Critique and Reframing

This essay explores the paper "Beyond Preferences in AI Alignment," authored by Tan Zhi-Xuan, Micah Carroll, Matija Franklin, and Hal Ashton, which critically examines the prevalent preferentist approach in AI alignment. The authors challenge the foundational assumptions about aligning AI with human preferences and argue for more enriched and context-sensitive models of alignment rooted in human values, normative standards, and social roles.

The Preferentist Approach and Its Limitations

The paper starts by characterizing the preferentist approach to AI alignment, encapsulated by four main theses:

Rational Choice Theory (RCT) as a Descriptive Framework
Expected Utility Theory (EUT) as a Normative Standard
Single-Principal Alignment as Preference Matching
Multi-Principal Alignment as Preference Aggregation

These theses imply that human values can be adequately represented through preferences, which can be optimized to ensure AI alignment. However, the authors meticulously dismantle these assumptions, revealing significant conceptual and technical limitations.

A Critical Evaluation of Rational Choice Theory

Initially, the paper critiques the descriptive adequacy of RCT for modeling human decision-making. Human behavior is often too complex and resource-bounded for RCT's assumption of optimal utility maximization to hold. The authors propose resource rationality as a more fitting alternative, emphasizing that human decision-making often approximates rationality within computational and informational bounds. Resource rationality allows for a more flexible and inductive approach to modeling systematic biases and heuristics that humans employ.

Moving Beyond Scalar Reward Functions

The work then critiques the representation of human preferences as scalar utility or reward functions, pointing out their limitations in capturing time-extended preferences and the incommensurability of values. It suggests adopting richer representations such as temporal logics, reward machines, and vector or interval-based utilities to model the nuanced structure of human preferences more accurately.

Reevaluating Expected Utility Theory's Normativity

Next, the authors challenge the normative foundations of EUT, illustrating that rational agents do not need to comply with EUT's axioms, such as completeness and transitivity, to avoid money-pump inconsistencies. They argue for the feasibility of designing AI systems that exhibit local coherence rather than global coherence, thereby circumventing some of the pathological incentives associated with EUT. This shift in emphasis allows AI systems to better align with the context-specific and bounded nature of human values.

Task-Specific Normative Criteria

Significantly, the paper argues against using static, asocial human preferences as the target for alignment. Instead, it posits that AI systems should align with informed and socially-contextual preferences. For narrow tasks, this implies aligning with task-specific norms. For more generalized systems, like AI assistants, alignment should be based on socially agreed upon standards that define the normative duties of the role the AI occupies.

Contractualist Approaches to Multi-Principal Alignment

For multi-principal alignment, the paper critically scrutinizes preference aggregation, highlighting its theoretical and practical inadequacies. Aggregating human preferences naively leads to socially and ethically contentious outcomes. The authors advocate for a shift towards contractualist approaches, where AI systems are aligned with mutual norms agreed upon by stakeholders, thus promoting a plurality of uses for AI while mitigating collective conflict and harm.

Implications and Future Directions

The implications of this work are manifold. Practically, it necessitates rethinking AI training paradigms to embed richer, context-sensitive models of human decision-making and value structures. Theoretically, it calls for the development of formal frameworks that unite game theory, social choice theory, and normative reasoning with AI design. The authors also emphasize the importance of political and social scaffolding, advocating for participatory frameworks and democratic oversight to ensure the fair and free elicitation of stakeholder values.

Conclusion

By challenging the preferentist framework and advocating for a more nuanced, value-oriented approach, the paper "Beyond Preferences in AI Alignment" contributes significantly to the discourse on ethically aligning AI with human interests. It emphasizes that AI alignment should not merely be about optimizing preferences but should be rooted in a deeper understanding of human values, norms, and the roles AI is intended to fulfill.

PDF Markdown

Related Papers

Tweets

https://twitter.com/xuanalogue/status/1831044533779669136

https://twitter.com/JagersbergKnut/status/1832440244437917775

https://twitter.com/xuanalogue/status/1831044564809110008

https://twitter.com/fly51fly/status/1830722276788215971

https://twitter.com/JagersbergKnut/status/1872995840547274780

https://twitter.com/AmyPrb/status/1830760038081642983

YouTube

Show All Videos