Unsupervised correction for lexically-adjusted polysemanticity metrics
Develop an unsupervised version of the lexically-adjusted polysemanticity score that identifies and discounts lexical-identity contributions to neuron-level polysemanticity in transformer MLP activations without using sense annotations, so that the correction can be applied beyond controlled, sense-labeled evaluations.
References
The current implementation requires sense labels, limiting it to controlled evaluations; extending this to an unsupervised correction is an open problem (Section~\ref{sec:discussion}).
— Polysemanticity or Polysemy? Lexical Identity Confounds Superposition Metrics
(2604.00443 - Hou et al., 1 Apr 2026) in Section 6 (Results), Subsection "The confound is correctable"