Resource Rational Contractualism Should Guide AI Alignment (2506.17434v1)

Published 20 Jun 2025 in cs.AI

Abstract: AI systems will soon have to navigate human environments and make decisions that affect people and other AI agents whose goals and values diverge. Contractualist alignment proposes grounding those decisions in agreements that diverse stakeholders would endorse under the right conditions, yet securing such agreement at scale remains costly and slow -- even for advanced AI. We therefore propose Resource-Rational Contractualism (RRC): a framework where AI systems approximate the agreements rational parties would form by drawing on a toolbox of normatively-grounded, cognitively-inspired heuristics that trade effort for accuracy. An RRC-aligned agent would not only operate efficiently, but also be equipped to dynamically adapt to and interpret the ever-changing human social world.

Summary

The paper proposes Resource Rational Contractualism (RRC) as a framework for AI alignment that adapts contractualist moral theory to the constraints of bounded rationality and limited resources.
RRC enables AI systems to approximate idealized contractualist reasoning by dynamically selecting among computational strategies that balance accuracy with computational cost.
Empirical tests show that RRC prompting improves AI accuracy on complex ethical cases requiring nuanced trade-offs, demonstrating practical efficiency benefits for real-world deployment.

Resource Rational Contractualism as a Framework for AI Alignment

"Resource Rational Contractualism Should Guide AI Alignment" (2506.17434) advances a normative and technical framework for AI alignment grounded in contractualist moral theory, but crucially adapted to the realities of bounded rationality and resource constraints. The authors argue that as AI systems increasingly operate in complex, pluralistic human environments, they must adjudicate between conflicting values and interests. Contractualism, which seeks principles that all affected parties could reasonably agree to under idealized conditions, is posited as a compelling alignment target. However, the paper identifies a central challenge: both humans and AI systems are fundamentally resource-bounded, making the computation of idealized contractualist solutions infeasible in practice. The proposed solution is Resource Rational Contractualism (RRC), a framework in which AI systems approximate the contractualist ideal by selecting among a repertoire of heuristics and reasoning strategies that trade off computational effort against alignment accuracy.

Theoretical Foundations

The contractualist tradition, spanning philosophy, economics, and evolutionary biology, provides a principled approach to resolving value conflicts by modeling what rational agents would agree to under fair bargaining conditions. In the context of AI alignment, this approach avoids the pitfalls of value imposition and enables context-sensitive alignment to diverse communities and domains. However, the idealized contractualist solution presupposes unlimited information, time, and computational resources—assumptions that are untenable for real-world AI systems.

RRC draws on cognitive science research indicating that humans employ resource-rational approximations to contractualist reasoning, selecting among cognitive mechanisms that abstract over the process (e.g., simulating negotiation vs. applying cached rules) and content (e.g., case-specific deliberation vs. general norms). The framework thus bridges the technical and normative aspects of alignment: technical choices about reasoning mechanisms directly constrain the normative targets that can be feasibly approximated.

Resource-Rational Mechanisms

The paper delineates a spectrum of mechanisms for approximating the contractualist ideal, organized along two axes:

Process Abstraction: Ranging from actual human bargaining (resource-intensive, high accuracy) to simulated bargaining (virtual bargaining, modeling stakeholders' interests), to the application of cached precedents or rules (high efficiency, lower accuracy in edge cases).
Content Abstraction: From case-specific negotiation to the adoption of general action-standards (rules or norms) that apply across classes of cases.

Concrete mechanisms include:

Actual Bargaining: Direct human deliberation, suitable for novel or high-stakes cases.
Virtual Bargaining: Simulation of stakeholder negotiation, potentially using explicit models of agents' preferences.
Modeling Implied Valuation: Inferring welfare trade-off ratios from observed or expected behavior.
Universalization: Evaluating the permissibility of rules by simulating their universal adoption.
Cached Outputs: Applying previously established welfare standards or action rules for efficiency.

A key challenge is the mechanism-selection problem: determining, in a resource-rational manner, which reasoning strategy to deploy in a given context, balancing computational cost against the need for alignment accuracy.

Empirical Evaluation

The authors present an experiment in which LLMs are prompted to use different moral reasoning strategies—minimal prompting, rule-based thinking, simulated bargaining, and resource-rational mechanism selection—on a set of moral vignettes. The vignettes are designed to distinguish between cases where rule-following suffices and cases where achieving mutual benefit requires rule violation (i.e., "hard" cases).

Key empirical findings:

Rule-based approaches are highly efficient and accurate on "easy" cases but fail on "hard" cases requiring nuanced trade-offs.
Simulated bargaining achieves high accuracy across all cases but at significant computational cost.
Resource-rational prompting enables models to dynamically select the appropriate reasoning strategy, achieving high accuracy with lower average computational effort.
The benefits of RRC prompting are most pronounced in smaller models, suggesting practical utility for resource-constrained deployments.

Practical Implications

RRC-aligned systems exhibit several practical virtues:

Interpretation of Human Norms: By embedding RRC mechanisms, AI systems can better interpret and apply human-made rules, which are often under-specified and context-dependent.
Adaptation to Dynamic Normative Contexts: The connection between heuristic rules and more compute-intensive contractualist reasoning allows for dynamic updating of norms as environments and stakeholder values change.
Assistance in Human Moral Deliberation: RRC-aligned AI can help humans transcend the limitations of simple rules, enabling more context-sensitive and mutually beneficial outcomes.
Reasonable Steerability: RRC provides a principled basis for bounded steerability, ensuring that AI agents are responsive to user preferences without enabling harm to others.

Implementation Directions

The paper outlines several concrete strategies for implementing RRC in AI systems:

Process-level Supervision: Training models with explicit reasoning traces that instantiate different RRC mechanisms, potentially via supervised fine-tuning on synthetic or human-generated data.
Debate Protocols: Using multi-agent debate to simulate contractualist bargaining, with each agent representing different stakeholders.
Neuro-Symbolic Approaches: Integrating symbolic representations of rules and preferences with neural models to enable formal specification and verification of mutual benefit, leveraging algorithms from game theory and probabilistic programming.

Data collection for RRC alignment will require large-scale, high-quality datasets capturing contractualist reasoning, community norms, and the outputs of legitimate democratic processes.

Limitations and Future Directions

The authors acknowledge several limitations: the need for broader and more representative datasets, the challenge of fully parameterizing the space of RRC mechanisms and their resource-accuracy trade-offs, and the necessity of verifying the causal relationship between resource usage and alignment accuracy. Future work should also explore the integration of RRC with other normative alignment targets and the development of scalable, robust meta-reasoning algorithms for mechanism selection.

Implications for AI Alignment Research

RRC offers a principled, flexible, and computationally tractable framework for AI alignment in pluralistic societies. By operationalizing the trade-off between resource constraints and normative ideals, it provides a blueprint for building AI systems that are both efficient and contextually aligned with human values. The approach is compatible with a range of technical architectures and can be instantiated via prompting, fine-tuning, debate, or neuro-symbolic integration. As AI systems become more autonomous and embedded in social contexts, the need for such resource-rational, context-sensitive alignment frameworks will become increasingly salient.

The paper's empirical results substantiate the claim that RRC-guided mechanism selection can yield high alignment accuracy with efficient resource use, especially in smaller models. This has direct implications for the deployment of aligned AI in real-world, resource-constrained settings. The framework also provides a foundation for future research on meta-reasoning, norm adaptation, and the integration of democratic processes into AI alignment pipelines.