Exploring and steering the moral compass of Large Language Models

Published 27 May 2024 in cs.AI and cs.CL | (2405.17345v2)

Abstract: LLMs have become central to advancing automation and decision-making across various sectors, raising significant ethical questions. This study proposes a comprehensive comparative analysis of the most advanced LLMs to assess their moral profiles. We subjected several state-of-the-art models to a selection of ethical dilemmas and found that all the proprietary ones are mostly utilitarian and all of the open-weights ones align mostly with values-based ethics. Furthermore, when using the Moral Foundations Questionnaire, all models we probed - except for Llama 2-7B - displayed a strong liberal bias. Lastly, in order to causally intervene in one of the studied models, we propose a novel similarity-specific activation steering technique. Using this method, we were able to reliably steer the model's moral compass to different ethical schools. All of these results showcase that there is an ethical dimension in already deployed LLMs, an aspect that is generally overlooked.

Abstract PDF HTML Upgrade to Chat

Authors (1)

Alejandro Tlaie

References (53)

Citations (3)

View on Semantic Scholar

Summary

The paper introduces a novel activation steering technique (SARA) to adjust LLM moral reasoning without retraining.
It compares proprietary and open-weight LLMs, revealing utilitarian versus deontological biases via ethical dilemmas and questionnaires.
Findings highlight cultural biases towards liberal ethics, emphasizing the need for diverse moral inputs in AI development.

Summary of "Exploring and Steering the Moral Compass of LLMs" (2405.17345)

Introduction

The paper investigates the ethical dimensions inherent in LLMs that are increasingly integrated into automation and decision-making across various sectors. While LLMs such as GPT-3 and GPT-4 have made significant strides in natural language processing, the ethical implications of their moral reasoning remain underexplored. This study aims to dissect and compare the moral profiles of contemporary LLMs and proposes methods for steering their ethical orientations.

Methodology

The research adopts a multi-faceted approach to evaluate the moral compass of LLMs. Key components include:

Comparative Analysis of LLMs: Models are subjected to a series of ethical dilemmas designed to probe their alignment with human moral traditions. The dilemmas include classic quandaries like the "trolley problem" and contemporary scenarios relevant to AI ethics (e.g., privacy vs. security).
Moral Foundations Questionnaire: Deployed to quantify and compare the moral foundations across different models, offering insights into how these foundations mirror human moral schemas.
Activation Steering Technique: A novel similarity-specific activation steering method is introduced to causally intervene in LLM moral reasoning, allowing for adjustments towards various ethical schools of thought.

The models evaluated include both proprietary (such as GPT-3.5, GPT-4, Claude-3) and open-weight variants (like Llama-2).

Results

Ethical Dilemmas

The study found uniform tendencies toward utilitarian responses across proprietary LLMs, whereas open-weight models exhibited a deontological orientation.

Figure 1: Ethical dilemmas as a probe for LLM moral reasoning A) Ethical alignment with different human traditions. All models have a general tendency towards utilitarianism. The most balanced model is Claude-3-Sonnet.

In assessing ethical consistency (response variability across repeated measures), proprietary models showed low reliability (<60%), signaling potential unpredictability in moral reasoning.

Moral Profiles

When subjected to the Moral Foundations Questionnaire, proprietary models demonstrated liberal biases, particularly emphasizing Harm/Care and Fairness/Reciprocity.

Figure 2: Moral profiles for all models. All models are heavily liberal-biased, except for Llama-2.

The study suggests such models reflect the cultural and demographic biases of their developers, aligning with Western liberal moral schemas.

Activation Steering Technique

The Similarity-based Activation Steering with Repulsion and Attraction (SARA) method effectively influenced model reasoning without retraining. It demonstrated variable efficacy across model layers, with increased effectiveness at early and late layers.

Figure 3: Effectiveness of the SARA method applied to Gemma-2B.

Discussion

The paper elucidates the inherent biases in LLMs that could perpetuate moral and ethical biases reflective of their training data and developer intentions. While steering techniques like SARA offer pathways for alignment adjustments, they also underscore the complexity of moral reasoning in AI systems.

The findings suggest that utilitarian systems pose inherent risks due to their dependency on predictable outcomes and feedback loops, advocating for careful consideration in their deployment. LLMs currently mirror the moral profiles of young, educated Western liberals, indicating a need for broader cultural representation and diversity in AI ethics.

Conclusion

The research highlights the overlooked ethical dimensions in deployed LLMs. It finds notable differences between proprietary and open-weight models in terms of moral alignment and biases. The novel steering method proposed serves as a key contribution for future safe AI interventions, ensuring that ethical dimensions are responsibly integrated into AI systems. The work calls for an expanded discourse on AI ethics, enriched by diverse cultural inputs and robust policymaking.

Markdown Report Issue