Investigate failure of final-layer linear probe in Bias in Bios classifier

Determine why a logistic-regression linear probe trained on Pythia-70M final-layer representations fails to achieve above-chance accuracy in the Bias in Bios profession classification setup, and identify the factors that make penultimate-layer representations usable for this probe while final-layer representations are not.

Background

In the Bias in Bios experiments, the classifier uses mean-pooled residual stream activations from the penultimate layer of Pythia-70M as inputs to a logistic-regression probe. The authors report that attempts to probe the final layer were unsuccessful, prompting the choice of penultimate-layer features.

Understanding this discrepancy would clarify representational properties across layers and guide best practices for layer selection in linear probing and downstream classification.

References

For unclear reasons, we were unable to fit a probe with greater-than-chance accuracy when applying logistic regression to representations extracted from the final layer; this is why we used penultimate layer representations above.

— Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models (2403.19647 - Marks et al., 2024) in Appendix: Implementation Details for Classifier Experiments, Classifier training

Investigate failure of final-layer linear probe in Bias in Bios classifier

Sponsor

Background

References

Related Problems