The Gaps between Pre-train and Downstream Settings in Bias Evaluation and Debiasing (2401.08511v1)

Published 16 Jan 2024 in cs.CL

Abstract: The output tendencies of Pre-trained LLMs (PLM) vary markedly before and after Fine-Tuning (FT) due to the updates to the model parameters. These divergences in output tendencies result in a gap in the social biases of PLMs. For example, there exits a low correlation between intrinsic bias scores of a PLM and its extrinsic bias scores under FT-based debiasing methods. Additionally, applying FT-based debiasing methods to a PLM leads to a decline in performance in downstream tasks. On the other hand, PLMs trained on large datasets can learn without parameter updates via In-Context Learning (ICL) using prompts. ICL induces smaller changes to PLMs compared to FT-based debiasing methods. Therefore, we hypothesize that the gap observed in pre-trained and FT models does not hold true for debiasing methods that use ICL. In this study, we demonstrate that ICL-based debiasing methods show a higher correlation between intrinsic and extrinsic bias scores compared to FT-based methods. Moreover, the performance degradation due to debiasing is also lower in the ICL case compared to that in the FT case.

PDF Abstract

Understanding the Chasm in Bias Evaluation and Debiasing

The Problem of Bias in Pre-trained Models

LLMs that are pre-trained on vast corpora of data have shown exceptional capabilities in understanding human language. However, these models inherit both the virtues and vices of their training data. Among the notable issues is that they capture social biases, including those related to gender, race, and religion, which exist in the training datasets. Addressing these biases is critical as LLMs are incorporated into tools and applications that significantly impact society.

Debiasing and Downstream Effects

Two primary avenues exist for leveraging pre-trained LLMs (PLMs) while minimizing biases: fine-tuning (FT) and in-context learning (ICL). FT modifies a model's parameters to adapt to specific tasks, but this approach carries the risk of diminishing the model's downstream performance by erasing useful information learned during pre-training. Debiasing strategies that rely on FT have been found to lead to lower model performance in downstream applications.

In contrast, ICL employs prompts to guide the PLM without updating the model's parameters, thus preserving the beneficial knowledge. Consequently, the paper hypothesized that debiasing methods using ICL would better maintain the performance of PLMs in downstream tasks and might show stronger correlations between intrinsic (pre-training) bias evaluations and extrinsic (downstream) bias detections.

Exploring the Bias Evaluation Gap

The researchers embarked on experimenting to examine this hypothesis. They evaluated gender bias in multiple languages by comparing stereotypical and counter-stereotypical sentence likelihoods through a set of intrinsic bias evaluation datasets. Subsequently, they explored the correlation between these intrinsic bias evaluations and downstream tasks (including question answering, natural language inference, and coreference resolution) using ICL-based debiasing methods.

The findings revealed that ICL-based methods indeed had a higher correlation between intrinsic and extrinsic bias scores when compared to FT-based methods. Furthermore, the performance degradation in downstream tasks due to debiasing was less severe in ICL settings than in FT settings.

Implications for Future Bias Mitigation

The insights from this research carry significant implications for the AI and machine learning communities. The paper cautions against extrapolating trends observed in FT settings to scenarios involving ICL without meticulous consideration. It convincingly demonstrates that the use of ICL-based debiasing leads to smaller changes in model parameters, thereby retaining more of the pre-trained knowledge and resulting in better alignment between pre-training values and downstream performance.

The paper advocates for specific discussions around ICL settings given their distinct dynamics compared to FT. It further calls for the need to examine more extensive ranges of LLMs, consider additional types of social biases, and verify findings across diverse languages. While the research concentrated on gender bias, the methodologies and findings could inform broader efforts to minimize various biases encoded in PLMs, advancing towards more equitable and responsible application of AI technologies.