Understanding the Chasm in Bias Evaluation and Debiasing
The Problem of Bias in Pre-trained Models
LLMs that are pre-trained on vast corpora of data have shown exceptional capabilities in understanding human language. However, these models inherit both the virtues and vices of their training data. Among the notable issues is that they capture social biases, including those related to gender, race, and religion, which exist in the training datasets. Addressing these biases is critical as LLMs are incorporated into tools and applications that significantly impact society.
Debiasing and Downstream Effects
Two primary avenues exist for leveraging pre-trained LLMs (PLMs) while minimizing biases: fine-tuning (FT) and in-context learning (ICL). FT modifies a model's parameters to adapt to specific tasks, but this approach carries the risk of diminishing the model's downstream performance by erasing useful information learned during pre-training. Debiasing strategies that rely on FT have been found to lead to lower model performance in downstream applications.
In contrast, ICL employs prompts to guide the PLM without updating the model's parameters, thus preserving the beneficial knowledge. Consequently, the paper hypothesized that debiasing methods using ICL would better maintain the performance of PLMs in downstream tasks and might show stronger correlations between intrinsic (pre-training) bias evaluations and extrinsic (downstream) bias detections.
Exploring the Bias Evaluation Gap
The researchers embarked on experimenting to examine this hypothesis. They evaluated gender bias in multiple languages by comparing stereotypical and counter-stereotypical sentence likelihoods through a set of intrinsic bias evaluation datasets. Subsequently, they explored the correlation between these intrinsic bias evaluations and downstream tasks (including question answering, natural language inference, and coreference resolution) using ICL-based debiasing methods.
The findings revealed that ICL-based methods indeed had a higher correlation between intrinsic and extrinsic bias scores when compared to FT-based methods. Furthermore, the performance degradation in downstream tasks due to debiasing was less severe in ICL settings than in FT settings.
Implications for Future Bias Mitigation
The insights from this research carry significant implications for the AI and machine learning communities. The paper cautions against extrapolating trends observed in FT settings to scenarios involving ICL without meticulous consideration. It convincingly demonstrates that the use of ICL-based debiasing leads to smaller changes in model parameters, thereby retaining more of the pre-trained knowledge and resulting in better alignment between pre-training values and downstream performance.
The paper advocates for specific discussions around ICL settings given their distinct dynamics compared to FT. It further calls for the need to examine more extensive ranges of LLMs, consider additional types of social biases, and verify findings across diverse languages. While the research concentrated on gender bias, the methodologies and findings could inform broader efforts to minimize various biases encoded in PLMs, advancing towards more equitable and responsible application of AI technologies.