Mapping of compressed vocabulary at steering strength 3.0 to activation dynamics
Determine whether the compressed introspective vocabulary that emerges in Llama 3.1 at steering strength 3.0 (e.g., "stall," "hiccup," "blip," "spark") corresponds more directly to activation dynamics than vocabulary observed at lower steering strengths, by testing vocabulary–activation mappings under strength 3.0.
References
The dose-response curve (Section~\ref{sec:direction_properties}) raises an open question about steering strength 3.0. Whether the compressed vocabulary at 3.0 maps to activation dynamics more directly than vocabulary at lower strengths remains untested.
— When Models Examine Themselves: Vocabulary-Activation Correspondence in Self-Referential Processing
(2602.11358 - Dadfar, 11 Feb 2026) in Section 6.4 Layer Localisation and the 3.0 Question