Papers
Topics
Authors
Recent
Search
2000 character limit reached

Instruction Tuning Changes How Upstream State Conditions Late Readout: A Cross-Patching Diagnostic

Published 8 May 2026 in cs.LG | (2605.07284v1)

Abstract: Recent interpretability work has identified model-internal handles on post-trained behavior, including refusal directions, assistant/persona axes, and sparse chat-tuning features. These results localize where behaviors can be read out or controlled, often in middle-to-late layers. We ask how earlier computation and the late stack cooperate to turn those differences into next-token margins. To test this, we introduce first-divergence cross-patching: at the first token where pretrained base (PT) and instruction-tuned (IT) checkpoints disagree, we cross each model's earlier-layer state with each model's late stack. The diagnostic separates training recipes: same-base instruction-following descendants show late effects that depend on their own earlier-layer state, while OpenMath2 math-domain SFT and controlled code/biomed CPT controls with verified domain learning do not; for OpenMath2, the late effect is already largely portable from base earlier-layer state. Across five dense families (4B-32B), the IT late stack adds +0.76 logits from PT upstream and +2.44 from IT upstream, giving a +1.68 interaction that is positive in every family. Thus the late stack has a real PT-upstream effect, but its larger effect in the IT checkpoint appears only when it reads its own post-trained upstream state. Sparse features in final MLP layers partially mediate the effect and are driven by upstream patches, supporting a handoff from earlier state to final-layer feature activation to IT-token margin. Forced-token scoring shows that the local token choice can change later exact-answer success. Operationally, paired-checkpoint studies that localize a difference to late layers should test whether it survives under the other checkpoint's upstream state before treating the late stack as self-contained.

Authors (1)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 1 like about this paper.