Systematic Characterization of the Effectiveness of Alignment in Large Language Models for Categorical Decisions (2409.18995v1)

Published 18 Sep 2024 in cs.CL and cs.AI

Abstract: As LLMs are deployed in high-stakes domains like healthcare, understanding how well their decision-making aligns with human preferences and values becomes crucial, especially when we recognize that there is no single gold standard for these preferences. This paper applies a systematic methodology for evaluating preference alignment in LLMs on categorical decision-making with medical triage as a domain-specific use case. It also measures how effectively an alignment procedure will change the alignment of a specific model. Key to this methodology is a novel simple measure, the Alignment Compliance Index (ACI), that quantifies how effectively a LLM can be aligned to a given preference function or gold standard. Since the ACI measures the effect rather than the process of alignment, it is applicable to alignment methods beyond the in-context learning used in this study. Using a dataset of simulated patient pairs, three frontier LLMs (GPT4o, Claude 3.5 Sonnet, and Gemini Advanced) were assessed on their ability to make triage decisions consistent with an expert clinician's preferences. The models' performance before and after alignment attempts was evaluated using various prompting strategies. The results reveal significant variability in alignment effectiveness across models and alignment approaches. Notably, models that performed well, as measured by ACI, pre-alignment sometimes degraded post-alignment, and small changes in the target preference function led to large shifts in model rankings. The implicit ethical principles, as understood by humans, underlying the LLMs' decisions were also explored through targeted questioning. This study motivates the use of a practical set of methods and the ACI, in the near term, to understand the correspondence between the variety of human and LLM decision-making values in categorical decision-making such as triage.

References (13)

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Tweets

https://twitter.com/zakkohane/status/1853796332172870070

https://twitter.com/ayushnoori/status/1868727784224571519

Systematic Characterization of the Effectiveness of Alignment in Large Language Models for Categorical Decisions (2409.18995v1)

Summary

Related Papers

Tweets