Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 84 tok/s

Gemini 2.5 Pro 37 tok/s Pro

GPT-5 Medium 18 tok/s Pro

GPT-5 High 15 tok/s Pro

GPT-4o 86 tok/s Pro

GPT OSS 120B 468 tok/s Pro

Kimi K2 229 tok/s Pro

2000 character limit reached

FairSteer: Inference Time Debiasing for LLMs with Dynamic Activation Steering (2504.14492v1)

Published 20 Apr 2025 in cs.CL

Abstract: LLMs are prone to capturing biases from training corpus, leading to potential negative social impacts. Existing prompt-based debiasing methods exhibit instability due to their sensitivity to prompt changes, while fine-tuning-based techniques incur substantial computational overhead and catastrophic forgetting. In this paper, we propose FairSteer, a novel inference-time debiasing framework without requiring customized prompt design or model retraining. Motivated by the linear representation hypothesis, our preliminary investigation demonstrates that fairness-related features can be encoded into separable directions in the hidden activation space. FairSteer operates in three steps: biased activation detection, debiasing steering vector (DSV) computation, and dynamic activation steering. Specifically, it first trains a lightweight linear classifier to detect bias signatures in activations, and then computes DSVs as intervention directions derived from small contrastive prompt pairs. Subsequently, it performs debiasing by adjusting activations with DSVs in the inference stage. Comprehensive evaluation with six LLMs demonstrates the superiority of FairSteer across question-answering, counterfactual input evaluation and open-ended text generation tasks. Code will be released.

Collections

Summary

FairSteer: Inference Time Debiasing for LLMs with Dynamic Activation Steering

The paper introduces FairSteer, a framework designed to address biases in LLMs at inference time without the need for prompt customization or model retraining. As LLMs are deployed across various applications, their tendency to absorb biases from their training data poses challenges that have been the focus of extensive research. Existing methods, notably those relying on prompt-based adjustments or fine-tuning, often suffer from stability issues or require significant computational resources. FairSteer presents an alternative that promises efficiency and adaptability.

The authors leverage the linear representation hypothesis, suggesting that fairness-related features are embedded as distinct directions within the activation spaces of LLMs. FairSteer's operation is threefold: activation detection, debiasing steering vector computation, and dynamic steering. Initially, a lightweight classifier identifies biases within the model's activations. Subsequently, the framework computes debiasing steering vectors derived from contrastive prompt pairs, which guide the model in adjusting biased outputs while retaining its foundational capabilities.

Significant experimental evidence is presented to underscore the effectiveness of FairSteer. Comprehensive evaluations were conducted using six LLMs across diverse tasks such as question answering, counterfactual input evaluation, and open-ended text generation. Across these tasks, FairSteer demonstrated a marked improvement in reducing biases, achieving over 90% accuracy in detecting bias signatures within model activations for intermediate layers.

The implications of this research are considerable, suggesting that more practical and far-reaching bias mitigation strategies can be integrated during model inference. From a theoretical standpoint, it strengthens the argument for explorations of the latent geometries in model architectures, enabling a deeper understanding of how semantic biases are represented. Practically, such developments could foster safer AI deployment and enhance the ethical performance of LLM-driven applications.

Looking forward, this approach could spearhead further developments in dynamic intervention strategies, potentially applying similar geometric analyses to uncover biases in other elements of AI models. Continuing to refine this approach by improving the robustness and precision of steering vectors could lead to broader adoption and integration into both existing and forthcoming LLM infrastructures. Increasing transparency in bias mitigation processes will also be important in building trust and reliability in AI technologies.

PDF Markdown

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Authors (7)

YouTube

Show All Videos