Measuring Value Alignment (2312.15241v1)
Abstract: As AI systems become increasingly integrated into various domains, ensuring that they align with human values becomes critical. This paper introduces a novel formalism to quantify the alignment between AI systems and human values, using Markov Decision Processes (MDPs) as the foundational model. We delve into the concept of values as desirable goals tied to actions and norms as behavioral guidelines, aiming to shed light on how they can be used to guide AI decisions. This framework offers a mechanism to evaluate the degree of alignment between norms and values by assessing preference changes across state transitions in a normative world. By utilizing this formalism, AI developers and ethicists can better design and evaluate AI systems to ensure they operate in harmony with human values. The proposed methodology holds potential for a wide range of applications, from recommendation systems emphasizing well-being to autonomous vehicles prioritizing safety.
- Norm approximation for imperfect monitors. In Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems, pages 1464–1466, 2017.
- Concrete problems in ai safety. arXiv preprint arXiv:1606.06565, 2016.
- Value alignment or misalignment–what will keep systems accountable? In Workshops at the Thirty-First AAAI Conference on Artificial Intelligence, 2017.
- Measuring collaborative emergent behavior in multi-agent reinforcement learning, 2018.
- Moral decision making frameworks for artificial intelligence. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 31, 2017.
- Allan Dafoe. Ai governance: a research agenda. Governance of AI Program, Future of Humanity Institute, University of Oxford, 2018.
- Design and evaluation of norm-aware agents based on normative markov decision processes. International Journal of Approximate Reasoning, 78:33–61, Nov 2016. doi: 10.1016/j.ijar.2016.06.005. URL https://doi.org/10.1016/j.ijar.2016.06.005. Published: 01 November 2016.
- Ai4people—an ethical framework for a good ai society: opportunities, risks, principles, and recommendations. Minds and Machines, 28(4):689–707, 2018.
- Scalable agent alignment via reward modeling: a research direction. arXiv preprint arXiv:2210.05756, 2022.
- A voting-based system for ethical decision making. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
- Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
- Multi-value alignment in normative multi-agent system: Evolutionary optimisation approach, 2023.
- On a formal model of safe and scalable self-driving cars. arXiv preprint arXiv:1708.06374, 2016.
- Value alignment: a formal approach, 2021.
- Defining and characterizing reward hacking, 2022.
- Corrigibility. In Workshops at the Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.
- Optimal policies tend to seek power. Advances in Neural Information Processing Systems, 34:23063–23074, 2021.
- Rl-rec: Recommendation algorithms based on reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 5101–5108, 2019.