Evaluating Qwen3 14B on nuanced conversational settings
Determine whether Qwen3 14B can be reliably evaluated on nuanced question sets (such as consciousness and chakras) used to study representational dynamics by first establishing adequate baseline accuracy in empty contexts and subsequently testing for conversation-induced changes comparable to those observed in other models.
References
Unfortunately, in our experiments this model does not produce reliably factual answers (or representations) to the more nuanced questions used in our later experiments even when those questions are asked in an empty context; without this baseline level of performance, we were unable to evaluate it on those other cases.
— Linear representations in language models can change dramatically over a conversation
(2601.20834 - Lampinen et al., 28 Jan 2026) in Appendix: Analyzing Qwen3 14B on opposite day (Appx. \ref{appx:analyses:qwen})