Identifying Human Values for AI Alignment

Ascertain which specific human values large language models and AI Agents should be aligned to, resolving ambiguity about the normative targets of alignment that go beyond minimal principles such as helpfulness, honesty, accuracy, and harmlessness.

Background

The authors explain contemporary alignment practices, including reinforcement learning with human or AI feedback, that tune models toward helpfulness, honesty, and harmlessness. They note that while minimal alignment objectives are commonly accepted, the broader question of which values AI systems should reflect remains unsettled.

This uncertainty matters for responsibly deploying AI Agents in complex, dynamic environments, where alignment objectives need clear normative grounding to avoid harmful or biased behaviors while preserving usefulness.

References

Although it is unclear as to what 'values' should be aligned to, it is generally accepted that a minimal alignment should involve following instructions, being helpful, honest and accurate, and harmless, where harmless means avoiding providing users the means to harm others (e.g., do not provide instructions on how to make bombs, conduct illegal activities, etc.).

— Responsible AI Agents (2502.18359 - Desai et al., 25 Feb 2025) in Section II.C (Value Alignment Limits Undesired LLM Outputs and AI Agent Actions)

Identifying Human Values for AI Alignment

Sponsor

Background

References

Related Problems