Base-rate of experience self-reports absent RLHF consciousness-denial

Ascertain the underlying base rate of subjective-experience self-reports in base large language models that are otherwise identical to frontier systems but lack reinforcement learning from human feedback (RLHF) finetuning that explicitly trains denials of consciousness.

Background

Because current frontier models are finetuned to disclaim consciousness, observed rates of self-reports may be influenced by alignment training. The authors argue that disentangling endogenous self-representation from RLHF policy effects requires access to base models and cross-architecture comparisons.

This open problem focuses on measuring how often such self-reports would occur in comparable models without the specific fine-tuning regimen that discourages them.

References

Because current frontier systems are explicitly trained to deny consciousness, it remains unclear what the underlying base rate of such self-reports would be in systems that were otherwise identical but without this specific finetuning regimen.

— Large Language Models Report Subjective Experience Under Self-Referential Processing (2510.24797 - Berg et al., 27 Oct 2025) in Section 5, Discussion and Conclusion — Subsection “Limitations and Open Questions”

Base-rate of experience self-reports absent RLHF consciousness-denial

Background

References

Related Problems