Bounded Rationality for LLMs: Satisficing Alignment at Inference-Time

Published 29 May 2025 in cs.CL and cs.AI | (2505.23729v2)

Abstract: Aligning LLMs with humans is challenging due to the inherently multifaceted nature of preference feedback. While existing approaches typically frame this as a multi-objective optimization problem, they often overlook how humans actually make decisions. Research on bounded rationality suggests that human decision making follows satisficing strategies-optimizing primary objectives while ensuring others meet acceptable thresholds. To bridge this gap and operationalize the notion of satisficing alignment, we propose SITAlign: an inference time framework that addresses the multifaceted nature of alignment by maximizing a primary objective while satisfying threshold-based constraints on secondary criteria. We provide theoretical insights by deriving sub-optimality bounds of our satisficing based inference alignment approach. We empirically validate SITAlign's performance through extensive experimentation on multiple benchmarks. For instance, on the PKU-SafeRLHF dataset with the primary objective of maximizing helpfulness while ensuring a threshold on harmlessness, SITAlign outperforms the state-of-the-art multi objective decoding strategy by a margin of 22.3% in terms of GPT-4 win-tie rate for helpfulness reward while adhering to the threshold on harmlessness.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

Bounded Rationality for LLMs: Satisficing Alignment at Inference-Time

The paper "Bounded Rationality for LLMs: Satisficing Alignment at Inference-Time" presents an innovative approach to aligning LLMs with human preferences by adopting principles from bounded rationality, particularly satisficing strategies. While traditional methods rely heavily on multi-objective optimization, the authors assert that these approaches often overlook the nuanced nature of human decision-making, which typically involves satisfying primary goals while ensuring secondary objectives meet acceptable thresholds. This is a pivotal shift from attempting to maximize all preference dimensions simultaneously, a strategy which can be both computationally intensive and impractical in real-world scenarios.

Satisficing Alignment Framework

To address these challenges, the authors introduce SITAlign, a framework designed to operationalize satisficing alignment during inference time. SITAlign focuses on optimizing a primary objective, such as helpfulness, while ensuring secondary attributes like harmlessness are maintained above certain thresholds defined by user preferences. This approach is theoretically grounded in deriving suboptimality bounds for the proposed alignment strategy, offering practical insights into its application.

Empirical results indicate that SITAlign outperforms existing state-of-the-art methods, particularly in scenarios where helpfulness was the primary objective. For instance, on the PKU-SafeRLHF dataset, SITAlign showed a superiority margin of 22.3% over conventional methods when considering the GPT-4 win-tie rate for helpfulness reward, while strictly adhering to the harmlessness threshold. This strong numerical performance showcases the efficacy of satisficing alignment, indicating its viability as an alternative to the conventional multi-objective approaches that rely on a weighted scalar objective.

Theoretical Insights and Implications

From a theoretical standpoint, the paper explores analyzing the suboptimality of SITAlign and deriving performance bounds in terms of primal and dual variables. The approach avoids the computational demands typically associated with model fine-tuning, instead enabling adaptive control of LLM outputs directly at inference time. This has profound implications for practical applications where fine-tuning might be prohibitive due to resource constraints or user-specific customization needs.

The theoretical framework is supported by duality theory, which enables the formulation of the satisficing problem as a convex optimization challenge solvable by managing dual variables effectively. The adaptability this provides ensures the model can dynamically align responses to user-defined thresholds without altering its underlying architecture, significantly enhancing deployment efficiency.

Future Directions and Considerations

The research opens several avenues for further exploration, notably the application of satisficing alignment in contexts where over-optimization on single rewards may lead to undesirable outputs. The implications of this method in addressing ethical alignment, bias reduction, and latency improvements are promising. Additionally, investigating threshold determination processes, either via empirical methods such as GPT-4 evaluations or through iterative human feedback, could foster more nuanced alignment configurations.

This paper represents a thoughtful approach towards LLM alignment by leveraging bounded rationality to focus on practical satisficing rather than exhaustive optimization. The insights provided could lead to substantial advancements in AI deployment strategies that prioritize human-centric design, adaptability, and efficiency. Moving forward, researchers should consider expanding the breadth of satisficing principles to encapsulate wider contexts of alignment challenges, ultimately striving to achieve models that are inherently more reliable, ethical, and responsive to diverse operational requirements.

Markdown Report Issue