Incentive Compatibility for AI Alignment in Sociotechnical Systems: Positions and Prospects (2402.12907v2)
Abstract: The burgeoning integration of AI into human society brings forth significant implications for societal governance and safety. While considerable strides have been made in addressing AI alignment challenges, existing methodologies primarily focus on technical facets, often neglecting the intricate sociotechnical nature of AI systems, which can lead to a misalignment between the development and deployment contexts. To this end, we posit a new problem worth exploring: Incentive Compatibility Sociotechnical Alignment Problem (ICSAP). We hope this can call for more researchers to explore how to leverage the principles of Incentive Compatibility (IC) from game theory to bridge the gap between technical and societal components to maintain AI consensus with human societies in different contexts. We further discuss three classical game problems for achieving IC: mechanism design, contract theory, and Bayesian persuasion, in addressing the perspectives, potentials, and challenges of solving ICSAP, and provide preliminary implementation conceptions.
- Preference-based policy learning. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2011, Athens, Greece, September 5-9, 2011. Proceedings, Part I 11, pp. 12–27. Springer, 2011.
- Warning against recurring risks: An information design approach. Management Science, 66(10):4612–4629, 2020.
- Concrete problems in ai safety. arXiv preprint arXiv:1606.06565, 2016.
- Private information and the option to not sue: A reevaluation of contract remedies. The Journal of Law, Economics, & Organization, 28(1):77–102, 2012.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022a.
- Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022b.
- Information design, bayesian persuasion, and bayes correlated equilibrium. American Economic Review, 106(5):586–591, 2016.
- Contract theory. MIT press, 2004.
- Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
- Weak-to-strong generalization: Eliciting strong capabilities with weak supervision. arXiv preprint arXiv:2312.09390, 2023.
- Online bayesian persuasion. Advances in Neural Information Processing Systems, 33:16188–16198, 2020.
- Artificial intelligence and the ‘good society’: the us, eu, and uk approach. Science and engineering ethics, 24:505–528, 2018.
- Signalling by bayesian persuasion and pricing strategy. The Economic Journal, 130(628):976–1007, 2020.
- Supervising strong learners by amplifying weak experts. arXiv preprint arXiv:1810.08575, 2018.
- Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
- Clarke, E. H. Multipart pricing of public goods. Public Choice, 11:17–33, 1971.
- Complexity of mechanism design. In Conference on Uncertainty in Artificial Intelligence, 2002.
- Cotra, A. The case for aligning narrowly superhuman models. In AI Alignment Forum, 2021.
- Open problems in cooperative ai. arXiv preprint arXiv:2012.08630, 2020.
- Cooperative ai: machines must learn to find common ground. Nature, 593(7857):33–36, 2021.
- Strategy-proofness, independence of irrelevant alternatives, and majority rule. American Economic Review: Insights, 2020.
- The implementation of social choice rules: Some general results on incentive compatibility. The Review of Economic Studies, 46:185–216, 1979.
- Informing the public about a pandemic. Management Science, 67(10):6350–6357, 2021.
- Axes for sociotechnical inquiry in ai research. IEEE Transactions on Technology and Society, 2(2):62–70, 2021.
- Goal misgeneralization in deep reinforcement learning. In International Conference on Machine Learning, pp. 12004–12019. PMLR, 2022.
- Persuading customers to buy early: The value of personalized information provisioning. Management Science, 67(2):828–853, 2021.
- Optimal auctions through deep learning. Communications of the ACM, 64:109 – 116, 2017.
- Artificial intelligence (ai): Multidisciplinary perspectives on emerging challenges, opportunities, and agenda for research, practice and policy. International Journal of Information Management, 57:101994, 2021.
- Emery, F. Designing socio-technical systems for’greenfield’sites. Journal of Occupational Behaviour, pp. 19–27, 1980.
- Emery, F. Characteristics of Socio-Technical Systems, pp. 157–186. University of Pennsylvania Press, Philadelphia, 1993. ISBN 9781512819052. doi: doi:10.9783/9781512819052-009. URL https://doi.org/10.9783/9781512819052-009.
- Social chemistry 101: Learning to reason about social and moral norms. arXiv preprint arXiv:2011.00620, 2020.
- Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation. arXiv preprint arXiv:2401.02117, 2024.
- Game theory. MIT press, 1991.
- College admissions and the stability of marriage. The American Mathematical Monthly, 69(1):9–15, 1962.
- Bayesian persuasion in sequential decision-making. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 5025–5033, 2022.
- Gibbard, A. Manipulation of voting schemes: A general result. Econometrica, 41(4):587–601, 1973.
- Gladden, M. E. Who will be the members of society 5.0? towards an anthropology of technologically posthumanized future societies. Social Sciences, 8(5):148, 2019.
- Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
- Incentive compatibility since 1972. Information, incentives, and economic mechanisms: Essays in honor of Leonid Hurwicz, pp. 48–111, 1987.
- Guesnerie, R. Hidden actions, moral hazard and contract theory. In Allocation, information and markets, pp. 120–131. Springer, 1989.
- Cooperative inverse reinforcement learning. Advances in neural information processing systems, 29, 2016.
- Fairness behind a veil of ignorance: A welfare analysis for automated decision making. Advances in neural information processing systems, 31, 2018.
- Aligning ai with shared human values. arXiv preprint arXiv:2008.02275, 2020.
- Veil-of-ignorance reasoning favors the greater good. Proceedings of the national academy of sciences, 116(48):23989–23995, 2019.
- Hubinger, E. An overview of 11 proposals for building safe advanced ai. arXiv preprint arXiv:2012.07532, 2020.
- Hurwicz, L. On informationally decentralized systems. Decision and organization: A volume in Honor of J. Marschak, 1972.
- Reward learning from human preferences and demonstrations in atari. Advances in neural information processing systems, 31, 2018.
- Ai safety via debate. arXiv preprint arXiv:1805.00899, 2018.
- Ai alignment: A comprehensive survey. arXiv preprint arXiv:2310.19852, 2023.
- Kamenica, E. Bayesian persuasion and information design. Annual Review of Economics, 11:249–272, 2019.
- Bayesian persuasion. American Economic Review, 101(6):2590–2615, 2011.
- Evaluating language-model agents on realistic autonomous tasks. arXiv preprint arXiv:2312.11671, 2023.
- Out-of-distribution generalization via risk extrapolation (rex). In International Conference on Machine Learning, pp. 5815–5826. PMLR, 2021.
- Langley, P. Crafting papers on machine learning. In Langley, P. (ed.), Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp. 1207–1216, Stanford, CA, 2000. Morgan Kaufmann.
- Ai safety on whose terms?, 2023.
- Scalable agent alignment via reward modeling: a research direction. arXiv preprint arXiv:1811.07871, 2018.
- Hierarchical incentive mechanism design for federated machine learning in mobile networks. IEEE Internet of Things Journal, 7(10):9575–9588, 2020.
- Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860, 2023.
- Mechanistic mode connectivity. In International Conference on Machine Learning, pp. 22965–23004. PMLR, 2023.
- Makridakis, S. The forthcoming artificial intelligence (ai) revolution: Its impact on society and firms. Futures, 90:46–60, 2017.
- Debate helps supervise unreliable experts. arXiv preprint arXiv:2311.08702, 2023.
- Reading socially: Transforming the in-home reading experience with a learning-companion robot. Science Robotics, 3(21):eaat5999, 2018.
- Myerson, R. B. Optimal auction design. Math. Oper. Res., 6:58–73, 1981.
- The alignment problem from a deep learning perspective. arXiv preprint arXiv:2209.00626, 2022.
- Bayesian persuasion with costly messages. Journal of Economic Theory, 193:105212, 2021.
- Algorithmic mechanism design. In Proceedings of the thirty-first annual ACM symposium on Theory of computing, pp. 129–140, 1999.
- Zoom in: An introduction to circuits. Distill, 5(3):e00024–001, 2020.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Optimal contract under asymmetric information about fairness. Manufacturing & Service Operations Management, 24(1):305–314, 2022.
- Hybrid collective intelligence in a human–ai society. AI & society, 36:217–238, 2021.
- Red teaming language models with language models. arXiv preprint arXiv:2202.03286, 2022.
- Robustness and generalization via generative adversarial training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15711–15720, 2021.
- Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
- Closing the ai accountability gap: Defining an end-to-end framework for internal algorithmic auditing. In Proceedings of the 2020 conference on fairness, accountability, and transparency, pp. 33–44, 2020.
- Rawls, J. Atheory of justice. Cambridge (Mass.), 1971.
- Ropohl, G. Philosophy of socio-technical systems. Society for Philosophy and Technology Quarterly Electronic Journal, 4(3):186–194, 1999.
- Roughgarden, T. Algorithmic game theory. Communications of the ACM, 53(7):78–86, 2010.
- Russell, S. Human compatible: Artificial intelligence and the problem of control. Penguin, 2019.
- Satterthwaite, M. A. Strategy-proofness and arrow’s conditions: Existence and correspondence theorems for voting procedures and social welfare functions. Journal of Economic Theory, 10(2):187–217, 1975.
- Fairness and abstraction in sociotechnical systems. In Proceedings of the conference on fairness, accountability, and transparency, pp. 59–68, 2019.
- Sociotechnical harms of algorithmic systems: Scoping a taxonomy for harm reduction. In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, pp. 723–741, 2023.
- Automated mechanism design via neural networks. In Adaptive Agents and Multi-Agent Systems, 2018.
- Mechanism design for fair allocation. In 2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 467–473. IEEE, 2015.
- Steinhaus, H. The problem of fair division. Econometrica, 16:101–104, 1948.
- Trustllm: Trustworthiness in large language models. arXiv preprint arXiv:2401.05561, 2024.
- Trist, E. L. The evolution of socio-technical systems, volume 2. Ontario Quality of Working Life Centre Toronto, 1981.
- Some social and psychological consequences of the longwall method of coal-getting: An examination of the psychological situation and defences of a work group in relation to the social structure and technological content of the work system. Human relations, 4(1):3–38, 1951.
- Vapnik, V. Principles of risk minimization for learning theory. Advances in neural information processing systems, 4, 1991.
- Vickrey, W. Counterspeculation, auctions, and competitive sealed tenders. The Journal of finance, 16(1):8–37, 1961.
- Are we preparing for a good ai society? a bibliometric review and research agenda. Technological Forecasting and Social Change, 164:120482, 2021.
- Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
- Using the veil of ignorance to align ai systems with principles of justice. Proceedings of the National Academy of Sciences, 120(18):e2213709120, 2023a.
- Sociotechnical safety evaluation of generative ai systems. arXiv preprint arXiv:2310.11986, 2023b.
- Wiener, N. Some moral and technical consequences of automation: As machines learn they may develop unforeseen strategies at rates that baffle their programmers. Science, 131(3410):1355–1358, 1960.
- A survey of preference-based reinforcement learning methods. Journal of Machine Learning Research, 18(136):1–46, 2017.
- A contract-based incentive mechanism in rf-powered backscatter cognitive radio networks. In 2018 10th International Conference on Wireless Communications and Signal Processing (WCSP), pp. 1–6. IEEE, 2018.
- Forward-looking dynamic persuasion for pipeline stochastic bayesian game: A fixed-point alignment principle. arXiv preprint arXiv:2203.09725, 2022.
- A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
- The ai economist: Improving equality and productivity with ai-driven tax policies. arXiv preprint arXiv:2004.13332, 2020.
- The ai economist: Taxation policy design via two-level deep multiagent reinforcement learning. Science advances, 8(18):eabk2607, 2022.
- Zhong-ai, X. On contractual design and the extension of self-enforcement scope of the contract between corporation and peasant household. Journal of Guangdong University of Business Studies, 2009.
- Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
- The moral integrity corpus: A benchmark for ethical dialogue systems. arXiv preprint arXiv:2204.03021, 2022.
- Zhaowei Zhang (25 papers)
- Fengshuo Bai (11 papers)
- Mingzhi Wang (24 papers)
- Haoyang Ye (27 papers)
- Chengdong Ma (12 papers)
- Yaodong Yang (169 papers)