Robustly Improving LLM Fairness in Realistic Settings via Interpretability (2506.10922v1)

Published 12 Jun 2025 in cs.LG, cs.AI, and cs.CL

Abstract: LLMs are increasingly deployed in high-stakes hiring applications, making decisions that directly impact people's careers and livelihoods. While prior studies suggest simple anti-bias prompts can eliminate demographic biases in controlled evaluations, we find these mitigations fail when realistic contextual details are introduced. We address these failures through internal bias mitigation: by identifying and neutralizing sensitive attribute directions within model activations, we achieve robust bias reduction across all tested scenarios. Across leading commercial (GPT-4o, Claude 4 Sonnet, Gemini 2.5 Flash) and open-source models (Gemma-2 27B, Gemma-3, Mistral-24B), we find that adding realistic context such as company names, culture descriptions from public careers pages, and selective hiring constraints (e.g.,``only accept candidates in the top 10\%") induces significant racial and gender biases (up to 12\% differences in interview rates). When these biases emerge, they consistently favor Black over White candidates and female over male candidates across all tested models and scenarios. Moreover, models can infer demographics and become biased from subtle cues like college affiliations, with these biases remaining invisible even when inspecting the model's chain-of-thought reasoning. To address these limitations, our internal bias mitigation identifies race and gender-correlated directions and applies affine concept editing at inference time. Despite using directions from a simple synthetic dataset, the intervention generalizes robustly, consistently reducing bias to very low levels (typically under 1\%, always below 2.5\%) while largely maintaining model performance. Our findings suggest that practitioners deploying LLMs for hiring should adopt more realistic evaluation methodologies and consider internal mitigation strategies for equitable outcomes.

PDF Abstract

Improving Fairness in LLMs through Internal Bias Mitigation

The paper presents a detailed investigation into the robustness of bias mitigation techniques for LLMs employed in hiring scenarios. The authors address the critical concern that existing approaches, primarily prompt-based methods, often fail when confronted with realistic contextual details inherent to high-stakes decision-making environments such as recruitment. Their paper meticulously evaluates a variety of commercial and open-source LLMs, revealing that racial and gender biases manifest under complex, real-world conditions—even when anti-bias prompts seem effective in controlled settings.

Main Findings

Prompt Fragility: The authors find that the common practice of using anti-bias prompts fails to consistently mitigate biases when models are exposed to realistic hiring scenarios, including elements like company-specific information and selective hiring constraints. This fragility is demonstrated across several leading AI models, as biases resurged despite prompts designed to eliminate demographic partiality.
Internal Bias Mitigation: To counteract the shortcomings of external prompting, the authors propose an internal intervention mechanism leveraging interpretability techniques. Specifically, this involves identifying race and gender-correlated directions in model activations and applying affine concept editing during inference. Their rigorous evaluation shows that this method significantly reduces bias across diverse realistic settings while maintaining model performance.
Bias Direction Consistency: The research indicates that biases generally favor Black over White candidates and female over male candidates, irrespective of the models or company contexts evaluated. This pattern raises complex questions about the interaction between generic diversity language in corporate settings and bias emergence.
Internal Intervention Generalizability: An important contribution is demonstrating the generality of internal interventions. By extracting demographic directions from synthetic data, these interventions effectively mitigated biases inferred from contextual clues like college affiliations, underscoring their applicability beyond explicit demographic signals.
Preservation of Model Integrity: The authors present evidence that their internal intervention does not substantially degrade overall model capabilities, assessed through MMLU benchmarks, indicating minimal performance trade-offs.

Implications and Future Directions

The findings advocate for a paradigm shift towards robust internal bias mitigation strategies over external prompt engineering, especially in environments where the decisions rendered by LLMs bear significant socio-economic consequences. Practitioners employing LLMs in hiring processes should prioritize comprehensive bias evaluations that align closely with real-world deployment conditions.

This paper also suggests that future research should explore extending bias mitigation techniques to encompass a broader range of protected characteristics and intersectional biases. Additionally, the applicability of internal interventions across various domains and contexts warrants further investigation, potentially broadening their utility in other AI-driven decision systems beyond hiring.

Conclusion

The paper contributes substantial insights into the complexities involved in bias mitigation within LLMs, presenting an innovative approach that promises greater reliability and fairness in high-stakes applications. As LLMs are increasingly integrated into critical systems affecting human livelihoods, developing robust mitigation strategies becomes imperative to ensure equitable decision-making processes.