Anchored Answers: Unravelling Positional Bias in GPT-2's Multiple-Choice Questions
The paper "Anchored Answers: Unravelling Positional Bias in GPT-2's Multiple-Choice Questions" by Ruizhe Li and Yanjun Gao investigates a critical yet enigmatic issue in the GPT-2 family: anchored bias in multiple-choice questions (MCQs). This bias refers to the model's tendency to prefer the first choice ('A') regardless of the question context, potentially skewing the integrity of its decision-making process.
The authors embark on a mechanistic interpretability paper to dissect the internal workings of GPT-2 models, focusing on the Multi-Layer Perceptron (MLP) layers and attention heads. Utilizing the "logit lens" technique, they trace the bias origins to specific value vectors in the MLP modules, which inherently favour the first-choice option. The paper reveals that anchored bias is not uniformly distributed across all layers but is concentrated in particular layers close to the model's output. For instance, layers such as layer 9 in GPT2-Small and layer 34 in GPT2-Large show significant bias.
To counteract this bias, the authors propose a minimal yet effective intervention strategy. They modify the value vectors associated with the bias by directly updating them to de-emphasize the biased choices. This alteration resulted in a noticeable improvement in MCQ prediction accuracy over multiple datasets, including IOI and ARC, demonstrating the effectiveness of their approach across different settings.
Beyond merely addressing performance issues, the paper explores the implications of these findings for future model robustness and integrity of LLM outputs. By identifying and rectifying bias at the internal neuron level rather than the preprocessing stage, this research opens up new avenues for mitigating biases in LLMs without extensive prompt engineering or dataset alterations.
Furthermore, the research speculates on the potential future developments in AI, particularly in enhancing model interpretability and fairness. The authors suggest that further investigations could explore similar biases in other model families or extend these strategies to other tasks involving natural language understanding and generation.
In conclusion, the paper provides a comprehensive analysis of positional bias in GPT-2 models, offering a systematic approach to uncover and mitigate such biases. The implications of this research are significant, laying a foundation for more equitable and bias-free AI technologies in language processing tasks. The findings not only enhance our understanding of model biases but also contribute to the wider discourse on ensuring fairness and accuracy in AI-driven decision-making systems.