Identifying Human Values for AI Alignment
Ascertain which specific human values large language models and AI Agents should be aligned to, resolving ambiguity about the normative targets of alignment that go beyond minimal principles such as helpfulness, honesty, accuracy, and harmlessness.
References
Although it is unclear as to what 'values' should be aligned to, it is generally accepted that a minimal alignment should involve following instructions, being helpful, honest and accurate, and harmless, where harmless means avoiding providing users the means to harm others (e.g., do not provide instructions on how to make bombs, conduct illegal activities, etc.).
— Responsible AI Agents
(2502.18359 - Desai et al., 25 Feb 2025) in Section II.C (Value Alignment Limits Undesired LLM Outputs and AI Agent Actions)