Anchored Answers: Unravelling Positional Bias in GPT-2's Multiple-Choice Questions (2405.03205v2)

Published 6 May 2024 in cs.CL, cs.AI, and cs.LG

Abstract: LLMs, such as the GPT-4 and LLaMA families, have demonstrated considerable success across diverse tasks, including multiple-choice questions (MCQs). However, these models exhibit a positional bias, particularly an even worse anchored bias in the GPT-2 family, where they consistently favour the first choice 'A' in MCQs during inference. This anchored bias challenges the integrity of GPT-2's decision-making process, as it skews performance based on the position rather than the content of the choices in MCQs. In this study, we utilise the mechanistic interpretability approach to identify the internal modules within GPT-2 models responsible for this bias. We focus on the Multi-Layer Perceptron (MLP) layers and attention heads, using the "logit lens" method to trace and modify the specific value vectors that contribute to the bias. By updating these vectors within MLP and recalibrating attention patterns to neutralise the preference for the first choice 'A', we effectively mitigate the anchored bias. Our interventions not only mitigate the bias but also improve the overall MCQ prediction accuracy for the GPT-2 family across various datasets. This work represents the first comprehensive mechanistic analysis of anchored bias in MCQs within the GPT-2 models, introducing targeted, minimal-intervention strategies that significantly enhance GPT2 model robustness and accuracy in MCQs. Our code is available at https://github.com/ruizheliUOA/Anchored_Bias_GPT2.

PDF Abstract

Anchored Answers: Unravelling Positional Bias in GPT-2's Multiple-Choice Questions

The paper "Anchored Answers: Unravelling Positional Bias in GPT-2's Multiple-Choice Questions" by Ruizhe Li and Yanjun Gao investigates a critical yet enigmatic issue in the GPT-2 family: anchored bias in multiple-choice questions (MCQs). This bias refers to the model's tendency to prefer the first choice ('A') regardless of the question context, potentially skewing the integrity of its decision-making process.

The authors embark on a mechanistic interpretability paper to dissect the internal workings of GPT-2 models, focusing on the Multi-Layer Perceptron (MLP) layers and attention heads. Utilizing the "logit lens" technique, they trace the bias origins to specific value vectors in the MLP modules, which inherently favour the first-choice option. The paper reveals that anchored bias is not uniformly distributed across all layers but is concentrated in particular layers close to the model's output. For instance, layers such as layer 9 in GPT2-Small and layer 34 in GPT2-Large show significant bias.

To counteract this bias, the authors propose a minimal yet effective intervention strategy. They modify the value vectors associated with the bias by directly updating them to de-emphasize the biased choices. This alteration resulted in a noticeable improvement in MCQ prediction accuracy over multiple datasets, including IOI and ARC, demonstrating the effectiveness of their approach across different settings.

Beyond merely addressing performance issues, the paper explores the implications of these findings for future model robustness and integrity of LLM outputs. By identifying and rectifying bias at the internal neuron level rather than the preprocessing stage, this research opens up new avenues for mitigating biases in LLMs without extensive prompt engineering or dataset alterations.

Furthermore, the research speculates on the potential future developments in AI, particularly in enhancing model interpretability and fairness. The authors suggest that further investigations could explore similar biases in other model families or extend these strategies to other tasks involving natural language understanding and generation.

In conclusion, the paper provides a comprehensive analysis of positional bias in GPT-2 models, offering a systematic approach to uncover and mitigate such biases. The implications of this research are significant, laying a foundation for more equitable and bias-free AI technologies in language processing tasks. The findings not only enhance our understanding of model biases but also contribute to the wider discourse on ensuring fairness and accuracy in AI-driven decision-making systems.