Affine Steering of LLM Representations
- The paper presents a systematic framework for affine steering of LLM representations to match ideal, fair distributions via closed-form transformation parameters.
- It employs KL divergence minimization to compute group-specific affine maps that align internal activations with fairness constraints such as demographic parity and equal opportunity.
- Empirical validations on synthetic and real-world tasks demonstrate that the approach reduces fairness gaps while preserving or enhancing predictive performance.
Affine steering of LLM representations encompasses a suite of techniques for directly intervening in the internal activations of models, with the objective of causally guiding model behavior along semantically meaningful axes. In the context of LLMs, affine steering refers not only to shifting activations by a fixed vector (additive control) but, more generally, to applying affine transformations (linear scaling plus translation) grounded in clear statistical or behavioral targets. Among key applications, one prominent example is the imposition of group fairness constraints: by mapping internal representations toward distributions that provably ensure group-fair outcomes, it is possible to guarantee that the Bayes-optimal classifier operating on these representations achieves exact equal opportunity or demographic parity, with minimal sacrifice to predictive utility (Sharma et al., 19 Sep 2025). This article systematically outlines the theoretical formulation, algorithms, empirical validation, and practical implications of affine steering for fairness in LLM internal representation spaces.
1. Exact Fairness via Representation Steering
The foundational notion is that of an “ideal” distribution: a joint distribution over features , sensitive attribute , and label on which, for any (cost-sensitive) risk, the optimal classifier automatically satisfies a group-fair criterion such as demographic parity or equal opportunity. In the parametric setting, for example where the group- and class-conditional feature distributions are Gaussian, this ideal distribution can be analytically characterized via explicit constraints on the means, covariances, and mixture weights:
- For binary and ,
- Constraints for ideality require, e.g., and , ensuring that conditional class separations and variances are matched across groups.
An optimization program is defined to seek, among all possible ideal distributions , the one nearest to the empirical (observed) data distribution in Kullback–Leibler (KL) divergence: where denotes the set of all ideal (fair) distributions. This program anchors all subsequent affine steering steps.
2. Affine Transformation of LLM Internal Representations
Once target (ideal) distributional parameters are determined by constrained KL minimization, the central affine steering step is to shift the empirical LLM representation to a new variable via a group- and class-specific affine map: with parameters explicitly selected to ensure
The practical steering is implemented by interpolating between the original and transformed representations: where gives the steering intensity.
This affine transformation guarantees that the first two moments of the internal representation distributions per group–class pair match those of the ideal (fair) distribution. The approach is applicable in any setting where the representation distributions can be meaningfully modeled as coming from a known parametric family (Gaussian or log-normal), but also generalizes to other families for which closed-form KL divergences and transformation rules are tractable.
3. Algorithms for KL-Optimal Fair Steering
The optimization for the ideal distribution (and thus the transformation parameters) is made tractable by closed-form expressions for KL divergence between mixtures of Gaussians. For instance, in the univariate case: subject to ideality constraints among detailed above. For the “affirmative action” scenario (only underprivileged groups are transformed), the minimization is convex and admits closed-form or efficient 1D search solutions for the scaling parameter linking ideal and original standard deviations: More general, multivariate extensions follow similar logic with matrix square roots.
4. Empirical Validation: Synthetic and Real-World Tasks
Empirical analysis demonstrates the efficacy of the approach:
- Synthetic Experiments: Simulated data with known group-dependent disparities verifies that affine steering can map observed representations to the nearest ideal distribution, achieving near-zero fairness gaps (difference in TPR, FPR, etc.) for a variety of plausible class priors and variances. In several cases, downstream classifier error (Bayes error) is actually decreased—since distributional bias may harm both fairness and utility if left unchecked.
- Real-World Benchmark—Bias in Bios: In multi-class occupation classification, LLM representations (e.g., from Llama-2) are steered using the derived affine map. The classifier trained on these representations achieves marked reductions in fairness gaps (TPR-gap across gender/profession slices) with almost no decrease in overall accuracy (e.g., maintained around 0.77–0.79). This result holds even in comparison to prior baselines such as MiMiC or LEACE, with several cases showing improved utility after steering (Sharma et al., 19 Sep 2025).
5. Role of Parametric Families and Algorithmic Extension
The framework flexibly extends to various parametric families. For multivariate normal and log-normal families, fairness constraints can be re-expressed as relationships among group- and class-wise means and covariances, and the convex structure of trace and log-determinant terms in KL divergence ensures the optimization over means and variances remains tractable. Closed-form updates are available for the univariate case, and either convex optimization or efficient line search suffices for higher dimensions or more intricate subgroup structures.
A summary of technical steps:
Step | Operation | Mathematical Object(s) |
---|---|---|
Parametric ideality setup | Specify class/group-conditional forms | |
KL minimization | Optimize nearest | under group constraints |
Affine steering map | Compute | , |
Activation update | Apply | (steering strength parameter) |
The closed-form and convexity guarantees enable robust and efficient deployment for a range of fairness-sensitive tasks.
6. Theoretical Significance and Practical Implications
By explicitly tying the distributional parameters (first and second moments) of internal LLM representations to provably fair “ideal” versions, the affine steering framework ensures that downstream classifiers—without further supervision—inherit group fairness properties by construction. Unlike post-hoc regularization or adversarial correction schemes, this approach does not mandate any compromise in predictive performance: in several scenarios, utility is unchanged or increased after steering. Notably, the method respects the original data distribution as closely as possible (in KL sense), minimizing excess distortion.
Practical applications highlighted include:
- Fairness interventions in multi-class classification (e.g., occupational prediction from biographies), sharply reducing TPR/FPR disparities for protected groups without retraining (Sharma et al., 19 Sep 2025).
- Systematic redistribution of latent representations in LLMs to guarantee parity or opportunity constraints across demographics, with efficient closed-form updates for widespread use in real-time systems or batch pipelines.
A plausible implication is that, by extending beyond additive shifts to general affine transformations rooted in the distributional moment structure, similar techniques may apply to other group-conditional alignment objectives: for instance, equalizing calibration or compositionality properties across groups.
7. Future Directions and Limitations
Current methods assume the feasibility of modeling internal representations within chosen parametric families; the extension to highly non-Gaussian or heavy-tailed data may require nonparametric generalizations. While all closed-form formulas and convex programs presented are tractable for common LLM settings, large numbers of sensitive groups or labels may require scalable implementation. Finally, while empirical results indicate no substantial utility-fairness trade-off, further validation on diverse LLM architectures and more complex real-world classification/regression tasks is warranted.
Overall, affine steering offers a rigorous, algorithmically efficient path to provable group fairness in downstream LLM outputs, and its underlying methodology provides a foundation for future developments in fair and interpretable representation engineering.