2000 character limit reached
Denoising Concept Vectors with Sparse Autoencoders for Improved Language Model Steering (2505.15038v1)
Published 21 May 2025 in cs.CL and cs.AI
Abstract: Linear Concept Vectors have proven effective for steering LLMs. While existing approaches like linear probing and difference-in-means derive these vectors from LLM hidden representations, diverse data introduces noises (i.e., irrelevant features) that challenge steering robustness. To address this, we propose Sparse Autoencoder-Denoised Concept Vectors (SDCV), which uses Sparse Autoencoders to filter out noisy features from hidden representations. When applied to linear probing and difference-in-means, our method improves their steering success rates. We validate our noise hypothesis through counterfactual experiments and feature visualizations.