An Exploratory Study on Threshold Priming in LLM-Based Batch Relevance Assessment
In the field of Information Retrieval (IR), the understanding of human cognitive biases has been pivotal in improving the design and evaluation of search systems. Extending this concept to AI, the paper "AI Can Be Cognitively Biased: An Exploratory Study on Threshold Priming in LLM-Based Batch Relevance Assessment" investigates whether LLMs inherit cognitive biases, specifically the threshold priming effect, akin to humans during batch relevance assessments. The authors, Nuo Chen et al., employed a systematic approach to examine the presence and impact of this bias across multiple LLM platforms, including GPT-3.5, GPT-4, LLaMa2-13B, and LLaMa2-70B.
Methodology and Experiment Design
The authors conducted experiments utilizing 10 topics from the TREC 2019 Deep Learning passage track collection, ensuring a diverse range of domains and relevance levels. Each LLM's relevance judgment was tested under various criteria: different document relevance scores, batch lengths, and the type of LLM used. The experiment setup included creating high threshold (HT) and low threshold (LT) prologues, and subsequently measuring the relevance assessment of identical epilogue documents by the LLMs. This careful design isolated the threshold priming effect by comparing the scores assigned to epilogue documents under HT and LT conditions.
Key Findings
The empirical results demonstrated that LLMs are indeed susceptible to threshold priming:
- Influence of Prologue Length: When both the prologue and epilogue lengths were short (PL = 4, EL = 4), all tested models exhibited significant threshold priming. Specifically, earlier higher relevance scores in the prologue led to lower relevance assessments in subsequent documents.
- Model-specific Observations: GPT-3.5 and GPT-4 showed pronounced threshold priming effects across most conditions. In contrast, LLaMa2-70B exhibited threshold priming primarily in configurations with shorter prologue lengths but exhibited an inversion of this effect under longer prologue lengths (PL = 8).
- Topic Sensitivity: The extent of threshold priming varied across different topics, indicating that certain queries are more prone to biased judgment by LLMs. Particularly, topics with less significant differences between LT and HT conditions hinted at other cognitive biases, such as the anchoring effect, influencing the models’ judgments.
Implications and Future Directions
The implications of these findings are both practical and theoretical. Firstly, they highlight that despite advanced capabilities, LLMs are not immune to cognitive biases. The awareness of such biases mandates that IR system designers and evaluators incorporate countermeasures against biases during algorithm training and evaluation. Additionally, these biases in AI could compound when these models interact with human users, potentially leading to an amplification of errors and misjudgments.
Theoretically, this paper underscores the necessity for a deeper investigation into the "bounded rationality" of LLMs. Borrowing from economic and cognitive theory, the AI research community needs to draw parallels with human cognitive limitations to better understand and mitigate these biases in machine learning models. Future research should focus on:
- Broader Dataset Evaluation: Testing a wider range of topics and scenarios to confirm the generalizability of these findings.
- Prompt Engineering: Exploring prompt modifications and structured queries to minimize the impact of cognitive biases.
- Bias Mitigation Techniques: Developing and integrating anti-bias protocols in training regimes of LLMs.
- Interdisciplinary Approaches: Leveraging psychological and behavioral insights to inform AI development, ensuring better aligned AI-augmented decision systems.
Conclusion
This exploratory paper initiates a crucial dialogue on the presence of human-like cognitive biases in LLMs, specifically within the context of threshold priming. As LLMs increasingly influence decision-making processes, understanding and addressing such biases become imperative. By revealing these biases, the research paves the way for developing more robust, unbiased AI systems that can fairly and effectively augment human judgments in diverse applications.