Investigate failure of final-layer linear probe in Bias in Bios classifier
Determine why a logistic-regression linear probe trained on Pythia-70M final-layer representations fails to achieve above-chance accuracy in the Bias in Bios profession classification setup, and identify the factors that make penultimate-layer representations usable for this probe while final-layer representations are not.
References
For unclear reasons, we were unable to fit a probe with greater-than-chance accuracy when applying logistic regression to representations extracted from the final layer; this is why we used penultimate layer representations above.
— Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
(2403.19647 - Marks et al., 28 Mar 2024) in Appendix: Implementation Details for Classifier Experiments, Classifier training