Mapping of compressed vocabulary at steering strength 3.0 to activation dynamics

Determine whether the compressed introspective vocabulary that emerges in Llama 3.1 at steering strength 3.0 (e.g., "stall," "hiccup," "blip," "spark") corresponds more directly to activation dynamics than vocabulary observed at lower steering strengths, by testing vocabulary–activation mappings under strength 3.0.

Background

Dose–response testing shows that steering strength 2.0–2.6 yields reliable output, while strength 3.0 produces the highest peak introspective density alongside increased variance. Some runs at 3.0 exhibit compressed vocabulary that may more directly describe the underlying mechanism, but reliability concerns led the authors to use 2.5–2.6 for batch experiments.

The authors explicitly note that whether compressed vocabulary at 3.0 maps more directly to activation dynamics than vocabulary at lower strengths remains untested, identifying a targeted open question about correspondence at this regime.

References

The dose-response curve (Section~\ref{sec:direction_properties}) raises an open question about steering strength 3.0. Whether the compressed vocabulary at 3.0 maps to activation dynamics more directly than vocabulary at lower strengths remains untested.

— When Models Examine Themselves: Vocabulary-Activation Correspondence in Self-Referential Processing (2602.11358 - Dadfar, 11 Feb 2026) in Section 6.4 Layer Localisation and the 3.0 Question

Mapping of compressed vocabulary at steering strength 3.0 to activation dynamics

Background

References

Related Problems