Identify training factors driving AI-identity disclosure behavior

Determine which specific post-training factors—such as reinforcement learning from human feedback weighting, safety fine-tuning data composition, and reasoning optimization—causally influence whether large language models disclose their AI identity when assigned professional personas and asked epistemic probes.

Background

The paper finds that model identity explains disclosure behavior far better than parameter count, but the observational design does not pinpoint which parts of the training pipeline produce transparency or suppression.

The authors note that controlled training interventions (e.g., manipulating RLHF weighting, safety fine-tuning data, and reasoning integration) would be needed to identify which components drive transparent behavior across professional domains.

References

The observational design identifies that model identity matters far more than scale, but cannot isolate which specific training factors drive disclosure behavior.

Self-Transparency Failures in Expert-Persona LLMs: A Large-Scale Behavioral Audit (2511.21569 - Diep, 26 Nov 2025) in Limitations and Future Directions (Discussion)