Alignment is Localized: A Causal Probe into Preference Layers (2510.16167v1)
Abstract: Reinforcement Learning frameworks, particularly those utilizing human annotations, have become an increasingly popular method for preference fine-tuning, where the outputs of a LLM are tuned to match a certain set of behavioral policies or guidelines. Reinforcement Learning through Human Feedback (RLHF) is perhaps the most popular implementation of such a framework, particularly for aligning LMs toward safety and human intent. However, the internal workings of how such alignment is achieved remain largely opaque. In this work, we systematically analyze preference optimization for LLM alignment by applying layer-wide causal patching between a base model and its tuned counterpart across human preference pairs. We implement our methodology on \textit{Llama-3.2-1B}, and find that alignment is spatially localized: mid-layer activations encode a distinct subspace that causally determines reward-consistent behavior, while early and late layers remain largely unaffected. Utilizing LASSO regression, we also find that only a small number of layers possess non-zero coefficients linking activation distances to reward gains. Overall, we show that, at least for some LLMs, alignment from human-based, preferential tuning is a directional, low rank process, rather than diffuse and parameteric.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days freePaper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.