Do current AIs have meaningful values?

Determine whether current large language models possess meaningful internal values in a substantive sense, rather than only exhibiting superficial behaviors or biases learned from training data.

Background

A longstanding concern in AI safety is whether advanced AI systems develop intrinsic goals or values. Historically, most control efforts have focused on shaping external behaviors rather than probing internal motivations. The abstract frames the central uncertainty motivating this work: it is unclear whether contemporary AI systems actually hold meaningful values. The paper proposes using utility functions to analyze preference coherence, arguing that coherent utilities would indicate meaningful internal value systems.

The authors’ broader research agenda—Utility Engineering—seeks to analyze and control these potential value systems. Establishing whether modern LLMs have meaningful values is foundational for justifying further utility analysis and control techniques and for assessing risks associated with emergent goals.

References

Tracking the emergence of goals and values has proven a longstanding problem, and despite much interest over the years it remains unclear whether current AIs have meaningful values.

— Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs (2502.08640 - Mazeika et al., 12 Feb 2025) in Abstract

Do current AIs have meaningful values?

Background

References

Related Problems