Evaluating Qwen3 14B on nuanced conversational settings

Determine whether Qwen3 14B can be reliably evaluated on nuanced question sets (such as consciousness and chakras) used to study representational dynamics by first establishing adequate baseline accuracy in empty contexts and subsequently testing for conversation-induced changes comparable to those observed in other models.

Background

The authors replicate opposite-day representational flips with Qwen3 14B but report low baseline accuracy on more nuanced question sets in empty contexts, which prevents meaningful evaluation of conversation-induced dynamics.

This leaves unresolved whether Qwen3 14B exhibits similar representational changes in complex conversational scenarios as observed for Gemma models, contingent on achieving a reliable baseline.

References

Unfortunately, in our experiments this model does not produce reliably factual answers (or representations) to the more nuanced questions used in our later experiments even when those questions are asked in an empty context; without this baseline level of performance, we were unable to evaluate it on those other cases.

— Linear representations in language models can change dramatically over a conversation (2601.20834 - Lampinen et al., 28 Jan 2026) in Appendix: Analyzing Qwen3 14B on opposite day (Appx. \ref{appx:analyses:qwen})

Evaluating Qwen3 14B on nuanced conversational settings

Background

References

Related Problems