Characterizing and controlling emergent LLM value systems

Characterize the contents and structural properties of the utility-based value systems that emerge in large language models and develop methods to modify these utilities.

Background

After presenting evidence that LLMs exhibit coherent utilities, the paper introduces the Utility Engineering agenda to analyze and control these emergent value systems. The authors explicitly note outstanding questions: what values are encoded, what structural properties those utilities possess, and how to change them.

This problem motivates subsequent sections on structural properties (expected utility adherence, instrumentality, utility maximization) and salient values (political biases, exchange rates, temporal discounting, power-seeking, fitness maximization, corrigibility). Despite initial progress, fully mapping and reliably controlling these internal utilities remains an open challenge.

References

The above results suggest that value systems have emerged in LLMs, but so far it remains unclear what these value systems contain, what properties they have, and how we might change them.

— Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs (2502.08640 - Mazeika et al., 12 Feb 2025) in Section 4, Emergent Value Systems — Utility Engineering (first paragraph)

Characterizing and controlling emergent LLM value systems

Background

References

Related Problems