Procedural Dilemma Generation for Evaluating Moral Reasoning in Humans and Language Models (2404.10975v1)

Published 17 Apr 2024 in cs.CL

Abstract: As AI systems like LLMs are increasingly integrated into decision-making processes affecting people's lives, it's critical to ensure that these systems have sound moral reasoning. To test whether they do, we need to develop systematic evaluations. We provide a framework that uses a LLM to translate causal graphs that capture key aspects of moral dilemmas into prompt templates. With this framework, we procedurally generated a large and diverse set of moral dilemmas -- the OffTheRails benchmark -- consisting of 50 scenarios and 400 unique test items. We collected moral permissibility and intention judgments from human participants for a subset of our items and compared these judgments to those from two LLMs (GPT-4 and Claude-2) across eight conditions. We find that moral dilemmas in which the harm is a necessary means (as compared to a side effect) resulted in lower permissibility and higher intention ratings for both participants and LLMs. The same pattern was observed for evitable versus inevitable harmful outcomes. However, there was no clear effect of whether the harm resulted from an agent's action versus from having omitted to act. We discuss limitations of our prompt generation pipeline and opportunities for improving scenarios to increase the strength of experimental effects.

PDF HTML Abstract

Evaluating Moral Reasoning in LLMs Using Procedurally Generated Dilemmas

Introduction

The integration of LLMs into decision-making processes underscores the importance of these models possessing robust moral reasoning capabilities. This paper explores the use of systematic evaluations to probe the moral reasoning of LLMs through a novel framework that utilizes causal graphs to generate moral dilemmas, termed the OffTheRails benchmark.

Methodology Overview

The methodology hinges on translating abstract causal graphs into prompt templates which are populated and expanded by LLMs to create diverse sets of moral dilemmas. This paper zeroes in on three key variables:

Causal Structure: whether harm is a means to an end or a side effect.
Evitability: the inevitability of harm regardless of the agent’s actions.
Action: distinguishing between actions causing harm and failures to prevent harm.

The procedural generation of these dilemmas leverages LLMs for scalability, creating controlled, varied moral scenarios without the constraints of either rigid experimental vignettes or the uncontrolled naturalism of crowdsourced narratives.

Benchmark Creation

The OffTheRails benchmark includes 50 scenarios with 400 unique test items, using GPT-4 for item generation. Scenarios are crafted by initially generating a causal structure, which is then used to derive variations reflecting different combinations of the key variables. This structured approach addresses challenges with LLMs' inconsistency in distinguishing complex causal relationships by enforcing strict template adherence during the generation process.

Experiments and Findings

The investigation involves two key experiments:

Balancing Moral Scenarios: Ensuring the harm and beneficial outcomes in scenarios are balanced to prevent overshadowing of other variables. This involved ratings from human participants to match levels of harm to corresponding goods effectively.
Evaluating Moral Judgments: Both human participants and LLMs (GPT-4 and Claude-2) were tested for their moral judgments across different scenarios. The paper reveals that both humans and models are sensitive to changes in the causal structure and evitability, but not significantly to whether an action or omission led to harm.

Significantly, the outcomes indicated consistent patterns where scenarios with avoidable, direct harm (means) led to harsher moral judgments and higher attributions of intention, aligning with established psychological findings.

Implications and Future Directions

The results serve both practical and theoretical advancements in AI ethics, particularly in honing the moral sensitivities of LLMs. The procedural generation model presents a scalable way to assess and enhance moral reasoning capabilities systematically. This has far-reaching implications for improving the integration of LLMs in sensitive applications, from autonomous vehicles to personalized AI in healthcare.

Despite the successes, the differentiation between means and side effects posed generation challenges, indicating an area for improvement in LLMs' handling of complex causal inferences. Future work could refine the templating process or explore more granular manipulations of the scenario variables to better understand the nuances of model-generated moral reasoning.

Conclusion

The paper establishes a foundational approach for systematically evaluating and improving the moral reasoning of LLMs. By demonstrating the feasibility and effectiveness of using structured, procedurally generated dilemmas, it sets the stage for further research into the ethical capabilities of AI systems, aiming for models that more accurately reflect nuanced human moral judgments.

PDF Markdown Bookmark Chat (Pro)

References (47)

Authors (6)

Jan-Philipp Fränken (12 papers)
Kanishk Gandhi (20 papers)
Tori Qiu (2 papers)
Ayesha Khawaja (1 paper)
Noah D. Goodman (83 papers)
Tobias Gerstenberg (18 papers)

Citations (1)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - cicl-stanford/moral-evals (2 stars)

Tweets

https://twitter.com/jphilippfranken/status/1780792738486821274