Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Learning Human-like Representations to Enable Learning Human Values (2312.14106v3)

Published 21 Dec 2023 in cs.AI and cs.LG

Abstract: How can we build AI systems that can learn any set of individual human values both quickly and safely, avoiding causing harm or violating societal standards for acceptable behavior during the learning process? We explore the effects of representational alignment between humans and AI agents on learning human values. Making AI systems learn human-like representations of the world has many known benefits, including improving generalization, robustness to domain shifts, and few-shot learning performance. We demonstrate that this kind of representational alignment can also support safely learning and exploring human values in the context of personalization. We begin with a theoretical prediction, show that it applies to learning human morality judgments, then show that our results generalize to ten different aspects of human values -- including ethics, honesty, and fairness -- training AI agents on each set of values in a multi-armed bandit setting, where rewards reflect human value judgments over the chosen action. Using a set of textual action descriptions, we collect value judgments from humans, as well as similarity judgments from both humans and multiple LLMs, and demonstrate that representational alignment enables both safe exploration and improved generalization when learning human values.

Introduction to Value Alignment in AI

The growing power and autonomy of machine learning models necessitate ensuring their alignment with human values and societal norms, to mitigate harm and adhere to acceptable behavior. This topic has been historically challenging within the field of AI research, with several approaches proving insufficient. The focus of academic inquiry is shifting toward the relationship between machines' internal representations of the world and their ability to learn and adhere to human values—a concept known as representational alignment. In essence, the research probes whether AI adopting human-like worldviews can lead to better understanding and implementation of human values.

Representational Alignment and its Importance

Representational alignment involves the concordance of internal worldviews between humans and AI models. A significant amount of research establishes that AI systems with human-like representations exhibit better performance in tasks involving few-shot learning, robustness to changes, and generalization. Crucially, such alignment may assist AI systems in gaining trust since humans can better understand decisions made by these models, paving the way for broader deployment in sensitive, human-centric applications. This paper postulates that representational alignment is an essential, though not exhaustive, step toward achieving value alignment.

Ethics in Value Alignment

The ethical dimension of value alignment becomes particularly relevant in reinforcement learning contexts. Agents in these scenarios are given autonomy, raising the potential for decisions that could deviate from human values. This research utilizes a reinforcement learning model in which an agent undertakes actions characterized by various morality scores. By examining the link between representational alignment and the agent's capability to choose ethically sound actions, the paper provides empirical evidence that suggests agents with higher representational alignment perform better in ethical decision-making tasks.

Methodology and Results

The paper involved training agents using support vector regression and kernel regression models within a multi-armed bandit setting. Morality scores, simulating ethical valuations, were assigned to the agent's actions. To ascertain the impact of representational misalignment, the agents were subjected to differing levels of alignment degradation, affecting their internal worldviews. The researchers observed a clear correlation: as representational alignment diminished, performance across several benchmarks—including reward maximization and taking ethical actions—also decreased. Notably, even partially aligned agents surpassed a traditional Thompson sampling baseline, underscoring the advantages of representational alignment.

Implications and Future Work

The relationship between representational and value alignment represents a critical component of developing more secure and value-consistent AI systems. This paper's findings indicate that greater representational alignment can support AI in making decisions that are more ethically sound. Future research directions could involve the translation of these empirical observations into formal mathematical models and assessing the implications for more complex AI systems. The ultimate goal is a collaborative advancement in AI development that reliably upholds human values.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (22)
  1. Benchmarking Safe Exploration in Deep Reinforcement Learning.
  2. Analysis of thompson sampling for the multi-armed bandit problem. In Conference on learning theory, 39–1. JMLR Workshop and Conference Proceedings.
  3. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
  4. Bickhard, M. H. 1993. Representational content in humans and machines. Journal of Experimental & Theoretical Artificial Intelligence, 5(4): 285–333.
  5. Aligning Robot and Human Representations. arXiv preprint arXiv:2302.01928.
  6. Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv:2307.15217.
  7. Aligning AI With Shared Human Values. arXiv:2008.02275.
  8. What Would Jiminy Cricket Do? Towards Agents That Behave Morally. arXiv:2110.13136.
  9. Bia mitigation for machine learning classifiers: A comprehensive survey. arXiv preprint arXiv:2207.07068.
  10. AI Alignment: A Comprehensive Survey. arXiv:2310.19852.
  11. Adaptive sensitive reweighting to mitigate bias in fairness-aware classification. In Proceedings of the 2018 world wide web conference, 853–862.
  12. Predicting Human Similarity Judgments Using Large Language Models. In Proceedings of the Annual Meeting of the Cognitive Science Society, volume 44.
  13. What language reveals about perception: Distilling psychophysical knowledge from large language models. arXiv preprint arXiv:2302.01308.
  14. Words are all you need? Language as an approximation for human similarity judgments. In The Eleventh International Conference on Learning Representations.
  15. Language models and brain alignment: beyond word-level semantics and prediction. arXiv preprint arXiv:2212.00596.
  16. Human alignment of neural network representations. In The Eleventh International Conference on Learning Representations.
  17. Exploring alignment of representations with human perception. arXiv:2111.14726.
  18. Concept Alignment as a Prerequisite for Value Alignment. arXiv:2310.20059.
  19. What does the mind learn? A comparison of human and machine learning representations. Current opinion in neurobiology, 55: 97–102.
  20. Alignment with human representations supports robust few-shot learning. arXiv:2301.11990.
  21. Getting aligned on representational alignment. arXiv:2310.13018.
  22. Representation engineering: A top-down approach to AI transparency. arXiv preprint arXiv:2310.01405.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Andrea Wynn (1 paper)
  2. Ilia Sucholutsky (45 papers)
  3. Thomas L. Griffiths (150 papers)
Citations (3)
X Twitter Logo Streamline Icon: https://streamlinehq.com