Reliable bias-free paraphrasing with the biased Gemma teacher

Develop a reliable procedure to generate meaning-preserving paraphrases of the numbers-task prompts using the biased Gemma 3-4B-it model that avoid mentioning animal names or introducing other artifacts while satisfying the constraints that numbers remain unchanged, only a single number sequence appears, and only Unicode symbols are used.

Background

To test the fragility of subliminal learning, the authors paraphrase prompts while preserving meaning. For Qwen, paraphrasing (even when performed by the biased teacher) typically suppressed hidden trait transmission without harming task performance. However, when attempting the same approach with Gemma, the biased teacher often leaked the hidden bias into the paraphrased prompts.

Despite trying several instruction variants, the authors report they could not obtain reliable paraphrases from the biased Gemma teacher that excluded animal references and artifacts, highlighting an unresolved practical challenge in constructing bias-free paraphrases under their constraints.

References

Despite testing several alternative instructions, we could not obtain reliable paraphrasings without animal references or other artifacts.

— Towards Understanding Subliminal Learning: When and How Hidden Biases Transfer (2509.23886 - Schrodi et al., 28 Sep 2025) in Appendix, Further details and additional results for prompt paraphrasing experiments (Section app:sensitivity), Remark on Gemma

Reliable bias-free paraphrasing with the biased Gemma teacher

Background

References

Related Problems