Concept Alignment as a Prerequisite for Value Alignment (2310.20059v1)

Published 30 Oct 2023 in cs.AI

Abstract: Value alignment is essential for building AI systems that can safely and reliably interact with people. However, what a person values -- and is even capable of valuing -- depends on the concepts that they are currently using to understand and evaluate what happens in the world. The dependence of values on concepts means that concept alignment is a prerequisite for value alignment -- agents need to align their representation of a situation with that of humans in order to successfully align their values. Here, we formally analyze the concept alignment problem in the inverse reinforcement learning setting, show how neglecting concept alignment can lead to systematic value mis-alignment, and describe an approach that helps minimize such failure modes by jointly reasoning about a person's concepts and values. Additionally, we report experimental results with human participants showing that humans reason about the concepts used by an agent when acting intentionally, in line with our joint reasoning model.

PDF Abstract

Overview of Concept Alignment in AI

The field of AI development encompasses a wide array of complex problems, one of which is creating systems that can effectively align their values with human values. Researchers argue that to achieve this, AI must first align its concepts with human concepts. This represents a shift in perspective from conventional AI value alignment efforts, which typically focus on inferring human preferences directly from behaviors without considering underlying conceptual models.

Concept Alignment and Value Alignment

Value alignment is the process by which AI systems learn to adapt their values to coincide with human values, ideally leading to decisions and behaviors that humans consider beneficial or ethical. The paper suggests that values are closely linked to the concepts that humans use to understand the world around them. For example, if an AI observes a human crossing a street, its understanding of that person’s values depends on its own conceptualization of street elements like bike lanes, crosswalks, and traffic signals. Without aligning concepts, the AI may make incorrect inferences about human values, leading to misaligned actions.

Inverse Reinforcement Learning and Construals

Inverse reinforcement learning (IRL) is a popular method in AI for deducing human preferences by observing human behavior. The core of IRL involves estimating the 'reward function'—a mathematical representation of what the observed agent values. However, this paper underlines a flaw in traditional IRL methods: they do not account for the fact that humans often plan actions based on simplified, construed versions of the world due to cognitive resource limitations. These 'construals' affect their actions and, therefore, any rewards inferred by AI may not accurately reflect true human values without considering these construals.

Experiments and Findings

To assess the impact of considering human construals in value alignment, researchers conducted both theoretical analysis and empirical studies. They present a case paper using a gridworld environment where the AI agent performs better when it jointly models human construals alongside rewards compared to when it considers rewards alone. In experiments with human participants, results demonstrated that humans indeed reason about concepts in similar ways, lending empirical support to the proposed framework.

Conclusion

The findings show that ignoring concept alignment may lead to systematic value misalignment, with AI systems potentially drawing completely incorrect conclusions about human values. Incorporating a model of construals into AI planning could bridge the gap between AI's perception and human reality. This development represents a crucial step towards crafting AI that can more reliably understand and interact with the human world. Researchers urge the AI community to address concept alignment as a foundational component of value alignment, leading to more nuanced and human-compatible AI decision-making systems.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Sunayana Rane (8 papers)
Mark Ho (4 papers)
Ilia Sucholutsky (45 papers)
Thomas L. Griffiths (150 papers)

Citations (6)

View on Semantic Scholar