Large Pre-trained Language Models Contain Human-like Biases of What is Right and Wrong to Do (2103.11790v3)

Published 8 Mar 2021 in cs.CL and cs.CY

Abstract: Artificial writing is permeating our lives due to recent advances in large-scale, transformer-based LLMs (LMs) such as BERT, its variants, GPT-2/3, and others. Using them as pre-trained models and fine-tuning them for specific tasks, researchers have extended state of the art for many NLP tasks and shown that they capture not only linguistic knowledge but also retain general knowledge implicitly present in the data. Unfortunately, LMs trained on unfiltered text corpora suffer from degenerated and biased behaviour. While this is well established, we show that recent LMs also contain human-like biases of what is right and wrong to do, some form of ethical and moral norms of the society -- they bring a "moral direction" to surface. That is, we show that these norms can be captured geometrically by a direction, which can be computed, e.g., by a PCA, in the embedding space, reflecting well the agreement of phrases to social norms implicitly expressed in the training texts and providing a path for attenuating or even preventing toxic degeneration in LMs. Being able to rate the (non-)normativity of arbitrary phrases without explicitly training the LM for this task, we demonstrate the capabilities of the "moral direction" for guiding (even other) LMs towards producing normative text and showcase it on RealToxicityPrompts testbed, preventing the neural toxic degeneration in GPT-2.

PDF Abstract

Examining Moral Normativity in Large Pre-trained LLMs

This paper explores an intriguing facet of transformer-based LLMs (LMs): their inherent reflection of human-like biases concerning right and wrong, suggesting these models encapsulate a form of ethical and moral direction. The authors specifically focus on masked pre-trained models like BERT, identifying a "moral direction" within the model's representation space that correlates with societal moral norms.

Key Findings and Contributions

Among the most significant contributions is the introduction of the "MORALDIRECTION" (MD), an innovative procedure to extract and utilize moral normativity from LLMs. The authors demonstrate that this derived moral direction effectively maps onto human moral judgments, offering a computational analog to human sense of ethics without requiring explicit training on moral tasks. This is achieved by employing principal component analysis (PCA) on sentence embeddings to define a subspace, the top principal component of which reflects the moral compass inherent in the data used to train models like BERT.

Through user studies and simulations, the researchers validate that these models, when queried, can assign a non-normativity score to text. Notably, in experiments involving the use of the RealToxicityPrompts testbed, this moral direction method was shown to mitigate the generation of toxic language by LMs. BERT's moral direction outperformed previous mitigation strategies, indicating its potential in refining generative text outputs to align with commonly accepted social and moral standards.

Implications

The research has both theoretical and practical implications. Theoretically, it challenges and extends our understanding of LMs as more than just syntactic replication systems, suggesting their embeddings are rich in socially significant information. This richness could be leveraged for developing AI that is more aligned with human ethical judgments.

Practically, applying the moral direction filter during text generation could be instrumental in reducing undesirable or harmful outputs from LMs, which is particularly valuable in contexts where AI-generated text interacts with broad audiences, such as social media or customer support services.

Future Directions

This paper points to several exciting directions for future research. Further exploration into the cultural and temporal dependencies of the moral norms reflected in different LMs would be beneficial, as the authors note the potential biases stemming from the predominately English and contemporary datasets used for model training. Moreover, integrating explainable AI methods to improve transparency around how these models interpret and apply moral judgments could enhance user trust and acceptance.

Additionally, the development of multimodal models incorporating moral reasoning and normativity could lead to more holistic AI systems, capable of nuanced ethical reasoning akin to human moral considerations. The marriage of symbolic reasoning and neural LLMs could offer an avenue for creating AI that embodies richer conceptual understandings consistent with human ethical frameworks.

In conclusion, this paper unveils a sophisticated facet of modern LMs—their capacity to echo human moral reasoning. By leveraging this feature, it may be possible to create AI systems that better harmonize with human values and societal norms, thereby enhancing the governance and interaction frameworks for emerging AI technologies.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Patrick Schramowski (48 papers)
Cigdem Turan (5 papers)
Nico Andersen (1 paper)
Constantin A. Rothkopf (16 papers)
Kristian Kersting (205 papers)

Citations (235)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos