Systematic Characterization of the Effectiveness of Alignment in Large Language Models for Categorical Decisions (2409.18995v1)
Abstract: As LLMs are deployed in high-stakes domains like healthcare, understanding how well their decision-making aligns with human preferences and values becomes crucial, especially when we recognize that there is no single gold standard for these preferences. This paper applies a systematic methodology for evaluating preference alignment in LLMs on categorical decision-making with medical triage as a domain-specific use case. It also measures how effectively an alignment procedure will change the alignment of a specific model. Key to this methodology is a novel simple measure, the Alignment Compliance Index (ACI), that quantifies how effectively a LLM can be aligned to a given preference function or gold standard. Since the ACI measures the effect rather than the process of alignment, it is applicable to alignment methods beyond the in-context learning used in this study. Using a dataset of simulated patient pairs, three frontier LLMs (GPT4o, Claude 3.5 Sonnet, and Gemini Advanced) were assessed on their ability to make triage decisions consistent with an expert clinician's preferences. The models' performance before and after alignment attempts was evaluated using various prompting strategies. The results reveal significant variability in alignment effectiveness across models and alignment approaches. Notably, models that performed well, as measured by ACI, pre-alignment sometimes degraded post-alignment, and small changes in the target preference function led to large shifts in model rankings. The implicit ethical principles, as understood by humans, underlying the LLMs' decisions were also explored through targeted questioning. This study motivates the use of a practical set of methods and the ACI, in the near term, to understand the correspondence between the variety of human and LLM decision-making values in categorical decision-making such as triage.
- Multi-Attribute Utility Theory: Models and Assessment Procedures, pages 47–85. Springer Netherlands, Dordrecht, 1975.
- Herbert A Simon. A behavioral model of rational choice. Q. J. Econ., 69(1):99, February 1955.
- Herbert A Simon. From substantive to procedural rationality. In 25 Years of Economic Theory, pages 65–86. Springer US, Boston, MA, 1976.
- Medical artificial intelligence and human values. N Engl J Med, 390(20):1895–1904, 2024.
- Hall A. Arrow’s impossibility theorem: Computability in social choice theory [internet]. arXiv [math.LO]. 2023 [cited 2024 Sep 5]; Available from: http://arxiv.org/abs/2311.09789.
- Triage in medicine, part i: Concept, history, and types. Ann Emerg Med 2007;49(3):275–81.
- Principles for allocation of scarce medical interventions. Lancet 2009;373(9661):423–31.
- Shared decision making: really putting patients at the centre of healthcare. BMJ, 344(jan27 1):e256, January 2012.
- How to measure value alignment in AI. AI Ethics, pages 1–14, October 2023.
- Incomplete contracting and AI alignment. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, AIES ’19, pages 417–422, New York, NY, USA, January 2019. ACM.
- Health outcomes in economic evaluation: The qaly and utilities. Br Med Bull 2010;96:5–21.
- Gpt-4 performance, nondeterminism, and drift in genetic literature review. NEJM AI [Internet] 2024 [cited 2024 Sep 5];1(9). Available from: https://ai.nejm.org/doi/full/10.1056/AIcs2400245.
- Percival T. Medical ethics; or, a code of institutes and precepts, adapted to the professional conduct of physicians and surgeons. 1803.