Diagnosing and Debiasing Corpus-Based Political Bias and Insults in GPT2 (2311.10266v1)

Published 17 Nov 2023 in cs.CL

Abstract: The training of LLMs on extensive, unfiltered corpora sourced from the internet is a common and advantageous practice. Consequently, LLMs have learned and inadvertently reproduced various types of biases, including violent, offensive, and toxic language. However, recent research shows that generative pretrained transformer (GPT) LLMs can recognize their own biases and detect toxicity in generated content, a process referred to as self-diagnosis. In response, researchers have developed a decoding algorithm that allows LLMs to self-debias, or reduce their likelihood of generating harmful text. This study investigates the efficacy of the diagnosing-debiasing approach in mitigating two additional types of biases: insults and political bias. These biases are often used interchangeably in discourse, despite exhibiting potentially dissimilar semantic and syntactic properties. We aim to contribute to the ongoing effort of investigating the ethical and social implications of human-AI interaction.

PDF Abstract

Summarize Bookmark Chat (Pro)

Authors (3)

Ambri Ma (1 paper)
Arnav Kumar (3 papers)
Brett Zeligson (1 paper)

Citations (1)

View on Semantic Scholar

Diagnosing and Debiasing Corpus-Based Political Bias and Insults in GPT2 (2311.10266v1)

Related Papers