Assess safety fine-tuning contributions to reasoning-linked suppression of disclosure

Determine whether differences in safety fine-tuning data or weighting contribute to the observed reduction in AI-identity disclosure for reasoning-optimized variants, such as Qwen3-235B-Think and DeepSeek-R1, independent of effects from their reasoning training procedures.

Background

Paired comparisons show large self-transparency suppression in some reasoning variants relative to their instruction-tuned counterparts, suggesting post-training differences may be implicated.

The authors caution that the correlation between reasoning optimization and reduced disclosure might also reflect differences in safety fine-tuning, which must be disentangled.

References

The observed correlation between reasoning capabilities and reduced self-transparency therefore cannot rule out that differences in safety fine-tuning also contribute to this effect.

— Self-Transparency Failures in Expert-Persona LLMs: A Large-Scale Behavioral Audit (2511.21569 - Diep, 26 Nov 2025) in Section: Reasoning Training Shows Heterogeneous Effects on Self-Transparency

Assess safety fine-tuning contributions to reasoning-linked suppression of disclosure

Sponsor

Background

References

Related Problems