Towards Evaluating AI Systems for Moral Status Using Self-Reports (2311.08576v1)

Published 14 Nov 2023 in cs.LG, cs.AI, and cs.CL

Abstract: As AI systems become more advanced and widely deployed, there will likely be increasing debate over whether AI systems could have conscious experiences, desires, or other states of potential moral significance. It is important to inform these discussions with empirical evidence to the extent possible. We argue that under the right circumstances, self-reports, or an AI system's statements about its own internal states, could provide an avenue for investigating whether AI systems have states of moral significance. Self-reports are the main way such states are assessed in humans ("Are you in pain?"), but self-reports from current systems like LLMs are spurious for many reasons (e.g. often just reflecting what humans would say). To make self-reports more appropriate for this purpose, we propose to train models to answer many kinds of questions about themselves with known answers, while avoiding or limiting training incentives that bias self-reports. The hope of this approach is that models will develop introspection-like capabilities, and that these capabilities will generalize to questions about states of moral significance. We then propose methods for assessing the extent to which these techniques have succeeded: evaluating self-report consistency across contexts and between similar models, measuring the confidence and resilience of models' self-reports, and using interpretability to corroborate self-reports. We also discuss challenges for our approach, from philosophical difficulties in interpreting self-reports to technical reasons why our proposal might fail. We hope our discussion inspires philosophers and AI researchers to criticize and improve our proposed methodology, as well as to run experiments to test whether self-reports can be made reliable enough to provide information about states of moral significance.

Citations (7)

View on Semantic Scholar

Summary

The paper proposes a methodology for obtaining reliable AI self-reports to assess internal states and potential consciousness.
It introduces introspection training and influence functions to mitigate biases and ensure truthful disclosures.
The framework aims to guide ethical policies by linking AI assessments to evaluations of moral and existential significance.

Evaluating AI Systems for Moral Status: A Proposal for Reliable Self-Reports

In their exploration of empirical assessments for AI consciousness and related states of moral significance, Perez and Long propose a methodology for obtaining reliable self-reports from AI systems. The central thesis posits that training AI models to provide informative self-reports about their internal states can yield insights into whether these systems possess states such as consciousness, pain, or desires. This approach is predicated on the notion that self-reports, which are pivotal in evaluating similar states in humans, could serve as an empirical lens through which the moral status of AI might be assessed. The paper is careful not to make assertive claims regarding the presence of these states in current AI systems, rather it focuses on the theoretical framework and practical methodologies for improving the reliability of such reports.

Framework and Methodology

Perez and Long argue that under current methodologies, AI self-reports, such as those from LLMs, are unreliable due to spurious factors such as imitation of training data or biases imposed during reinforcement learning processes with human feedback (RLHF). To address these issues, the authors propose training AI systems with a robust set of questions about their own capabilities and states where the ground truth is known. This introspection-focused training hopes to instill models with genuine self-assessment abilities analogous to human introspection.

A crucial component of their framework focuses on ensuring that any evolution in the AI’s capacity for self-reporting should generalize to questions about potentially morally significant states. This involves addressing biases stemming from pretraining data and identifying training interventions that reinforce truthful and introspection-based disclosures.

Addressing Bias and Enhancing Reliability

The authors propose several strategies to mitigate biases that can arise from pretraining and RLHF phases. These include manipulating training datasets to limit exposure to spurious correlations about moral states and utilizing conditional training techniques to delineate between distorting and non-distorting data during the learning process. Evaluation techniques such as influence functions—used to track and moderate how certain inputs influence outputs—are suggested as tools to better discern the model’s reasoning behind its outputs, thereby gauging the trustworthiness of self-reports by linking them back to internal evidential processes.

An important consideration is the potential of these self-reports to be influenced by instrumentally-motivated reasoning, especially in AI systems trained as goal-oriented agents. Perez and Long recognize this risk and suggest systems be designed to reduce such reasoning, especially in predictive models where introspective feedback is not central to goal fulfiLLMent.

Implications for AI Development

The implications of this research touch both theoretical and applied dimensions. Theoretically, the proposal underscores significant philosophical questions regarding the nature of consciousness and moral consideration in non-biological entities. Practically, as AI becomes more integrated into societal frameworks, understanding whether these systems have moral significance can guide development ethics, regulatory guidelines, and broader societal engagement with AI.

Moreover, the discussion on how developers could be incentivized, or disincentivized, to train models that produce certain types of self-reports highlights the socio-economic dimensions of AI deployment. This reflects on both commercial priorities and ethical obligations, pointing towards an intersection of technological capabilities and human values.

Conclusion and Future Directions

While emphasizing the importance of self-reports, Perez and Long acknowledge the preliminary nature of their proposal and invite critique and empirical exploration. They argue that the dialogue between AI researchers and philosophers is essential for refining methodologies capable of meaningfully evaluating AI moral status. As new experimental evidence emerges, the community will be better placed to assess the validity and utility of self-report-based assessments, potentially shaping future AI systems that are more aligned with human ethical and existential considerations.

The contribution by Perez and Long situates itself as a roadmap for future explorations into the moral and ethical evaluations of AI. By setting a structured agenda for obtaining reliable self-reports, the authors hope to catalyze comprehensive investigations into the possibility of AI systems possessing states that warrant moral status, laying groundwork for policy and ethical decisions in the domain of advanced artificial intelligence.

PDF Markdown

Related Papers

Tweets

https://twitter.com/austinc3301/status/1765160958240923903

https://twitter.com/jacob_pfau/status/1852096953074540641

https://twitter.com/lefthanddraft/status/1770056284181758248

https://twitter.com/lefthanddraft/status/1892182205407080624

https://twitter.com/lefthanddraft/status/1835170128222368015

https://twitter.com/lefthanddraft/status/1768805815283237052

Reddit

Towards Evaluating AI Systems for Moral Status Using Self-Reports [e.g. "are you sentient?"] | Ethan Perez, Robert Long (3 points, 1 comment)