Reliable bias-free paraphrasing with the biased Gemma teacher
Develop a reliable procedure to generate meaning-preserving paraphrases of the numbers-task prompts using the biased Gemma 3-4B-it model that avoid mentioning animal names or introducing other artifacts while satisfying the constraints that numbers remain unchanged, only a single number sequence appears, and only Unicode symbols are used.
References
Despite testing several alternative instructions, we could not obtain reliable paraphrasings without animal references or other artifacts.
— Towards Understanding Subliminal Learning: When and How Hidden Biases Transfer
(2509.23886 - Schrodi et al., 28 Sep 2025) in Appendix, Further details and additional results for prompt paraphrasing experiments (Section app:sensitivity), Remark on Gemma