- The paper challenges the reliance on scalar reward functions by demonstrating their limitations in capturing complex and context-specific human values.
- It introduces resource rationality as a superior framework that more accurately models bounded human decision-making compared to traditional rational choice theory.
- The work advocates for contractualist, socially-informed approaches that align AI systems with dynamic, task-specific normative standards.
Beyond Preferences in AI Alignment: A Critique and Reframing
This essay explores the paper "Beyond Preferences in AI Alignment," authored by Tan Zhi-Xuan, Micah Carroll, Matija Franklin, and Hal Ashton, which critically examines the prevalent preferentist approach in AI alignment. The authors challenge the foundational assumptions about aligning AI with human preferences and argue for more enriched and context-sensitive models of alignment rooted in human values, normative standards, and social roles.
The Preferentist Approach and Its Limitations
The paper starts by characterizing the preferentist approach to AI alignment, encapsulated by four main theses:
- Rational Choice Theory (RCT) as a Descriptive Framework
- Expected Utility Theory (EUT) as a Normative Standard
- Single-Principal Alignment as Preference Matching
- Multi-Principal Alignment as Preference Aggregation
These theses imply that human values can be adequately represented through preferences, which can be optimized to ensure AI alignment. However, the authors meticulously dismantle these assumptions, revealing significant conceptual and technical limitations.
A Critical Evaluation of Rational Choice Theory
Initially, the paper critiques the descriptive adequacy of RCT for modeling human decision-making. Human behavior is often too complex and resource-bounded for RCT's assumption of optimal utility maximization to hold. The authors propose resource rationality as a more fitting alternative, emphasizing that human decision-making often approximates rationality within computational and informational bounds. Resource rationality allows for a more flexible and inductive approach to modeling systematic biases and heuristics that humans employ.
Moving Beyond Scalar Reward Functions
The work then critiques the representation of human preferences as scalar utility or reward functions, pointing out their limitations in capturing time-extended preferences and the incommensurability of values. It suggests adopting richer representations such as temporal logics, reward machines, and vector or interval-based utilities to model the nuanced structure of human preferences more accurately.
Reevaluating Expected Utility Theory's Normativity
Next, the authors challenge the normative foundations of EUT, illustrating that rational agents do not need to comply with EUT's axioms, such as completeness and transitivity, to avoid money-pump inconsistencies. They argue for the feasibility of designing AI systems that exhibit local coherence rather than global coherence, thereby circumventing some of the pathological incentives associated with EUT. This shift in emphasis allows AI systems to better align with the context-specific and bounded nature of human values.
Task-Specific Normative Criteria
Significantly, the paper argues against using static, asocial human preferences as the target for alignment. Instead, it posits that AI systems should align with informed and socially-contextual preferences. For narrow tasks, this implies aligning with task-specific norms. For more generalized systems, like AI assistants, alignment should be based on socially agreed upon standards that define the normative duties of the role the AI occupies.
Contractualist Approaches to Multi-Principal Alignment
For multi-principal alignment, the paper critically scrutinizes preference aggregation, highlighting its theoretical and practical inadequacies. Aggregating human preferences naively leads to socially and ethically contentious outcomes. The authors advocate for a shift towards contractualist approaches, where AI systems are aligned with mutual norms agreed upon by stakeholders, thus promoting a plurality of uses for AI while mitigating collective conflict and harm.
Implications and Future Directions
The implications of this work are manifold. Practically, it necessitates rethinking AI training paradigms to embed richer, context-sensitive models of human decision-making and value structures. Theoretically, it calls for the development of formal frameworks that unite game theory, social choice theory, and normative reasoning with AI design. The authors also emphasize the importance of political and social scaffolding, advocating for participatory frameworks and democratic oversight to ensure the fair and free elicitation of stakeholder values.
Conclusion
By challenging the preferentist framework and advocating for a more nuanced, value-oriented approach, the paper "Beyond Preferences in AI Alignment" contributes significantly to the discourse on ethically aligning AI with human interests. It emphasizes that AI alignment should not merely be about optimizing preferences but should be rooted in a deeper understanding of human values, norms, and the roles AI is intended to fulfill.