Principal-Agent Reward Shaping in MDPs (2401.00298v1)
Abstract: Principal-agent problems arise when one party acts on behalf of another, leading to conflicts of interest. The economic literature has extensively studied principal-agent problems, and recent work has extended this to more complex scenarios such as Markov Decision Processes (MDPs). In this paper, we further explore this line of research by investigating how reward shaping under budget constraints can improve the principal's utility. We study a two-player Stackelberg game where the principal and the agent have different reward functions, and the agent chooses an MDP policy for both players. The principal offers an additional reward to the agent, and the agent picks their policy selfishly to maximize their reward, which is the sum of the original and the offered reward. Our results establish the NP-hardness of the problem and offer polynomial approximation algorithms for two classes of instances: Stochastic trees and deterministic decision processes with a finite horizon.
- E. Altman. Constrained Markov decision processes, volume 7. CRC press, 1999.
- S. Arora and P. Doshi. A survey of inverse reinforcement learning: Challenges, methods and progress. Artificial Intelligence, 297:103500, 2021.
- Admissible policy teaching through reward design. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 6037–6045, 2022.
- T. Başar and G. J. Olsder. Dynamic noncooperative game theory. SIAM, 1998.
- M. Battaglini. Long-term contracting with markovian consumers. American Economic Review, 95(3):637–658, 2005.
- P. Bolton and M. Dewatripont. Contract theory. MIT press, 2004.
- P. S. Castro. Scalable methods for computing state similarity in deterministic markov decision processes. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 10069–10076, 2020.
- Aligning agent policy with externalities: Reward design via bilevel rl. arXiv preprint arXiv:2308.02585, 2023.
- Adaptive model design for markov decision process. In International Conference on Machine Learning, pages 3679–3700. PMLR, 2022.
- Learning to incentivize information acquisition: Proper scoring rules meet principal-agent model. arXiv preprint arXiv:2303.08613, 2023.
- K. Deb. Multi-objective optimisation using evolutionary algorithms: an introduction. Springer, 2011.
- S. M. Devlin and D. Kudenko. Dynamic potential-based reward shaping. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems, pages 433–440. IFAAMAS, 2012.
- The complexity of contracts. SIAM Journal on Computing, 50(1):211–254, 2021.
- Optimal coordination in generalized principal-agent problems: A revisit and extensions. arXiv preprint arXiv:2209.01146, 2022.
- Computers and intractability, volume 174. freeman San Francisco, 1979.
- M. Grzes. Reward shaping in episodic reinforcement learning. In Proceedings of the 16th International Conference on Autonomous Agents and Multiagent Systems, pages 565–573. ACM, 2017.
- D. Hadfield-Menell and G. K. Hadfield. Incomplete contracting and ai alignment. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pages 417–422, 2019.
- J. D. Hartline and T. Roughgarden. Optimal mechanism design and money burning. In Proceedings of the fortieth annual ACM symposium on Theory of computing, pages 75–84, 2008.
- Adaptive contract design for crowdsourcing markets: Bandit algorithms for repeated principal-agent problems. In Proceedings of the fifteenth ACM conference on Economics and computation, pages 359–376, 2014.
- B. Holmström. Moral hazard and observability. The Bell journal of economics, pages 74–91, 1979.
- Learning to utilize shaping rewards: A new approach of reward shaping. Advances in Neural Information Processing Systems, 33:15931–15941, 2020.
- E. Kamenica. Bayesian persuasion and information design. Annual Review of Economics, 11:249–272, 2019.
- J.-J. Laffont. The principal agent model. Edward Elgar Publishing, 2003.
- Reinforcement Learning: Foundations. Online manuscript; https://sites.google.com/view/rlfoundations/home, 2022. accessed March-05-2023.
- S. Martello and P. Toth. Knapsack problems: algorithms and computer implementations. John Wiley & Sons, Inc., 1990.
- Policy invariance under reward transformations: Theory and application to reward shaping. In Icml, volume 99, pages 278–287. Citeseer, 1999.
- I. Post and Y. Ye. The simplex method is strongly polynomial for deterministic markov decision processes. Mathematics of Operations Research, 40(4):859–868, 2015.
- Policy teaching via environment poisoning: Training-time adversarial attacks against reinforcement learning. In International Conference on Machine Learning, pages 7974–7984. PMLR, 2020.
- J. Randløv and P. Alstrøm. Learning to drive a bicycle using reinforcement learning and shaping. In ICML, volume 98, pages 463–471, 1998.
- S. A. Ross. The economic theory of agency: The principal’s problem. The American economic review, 63(2):134–139, 1973.
- Learning intrinsic rewards as a bi-level optimization problem. In Conference on Uncertainty in Artificial Intelligence, pages 111–120. PMLR, 2020.
- Reinforcement Learning: An Introduction. The MIT Press, second edition, 2018.
- Bi-level optimization method for automatic reward shaping of reinforcement learning. In International Conference on Artificial Neural Networks, pages 382–393. Springer, 2022.
- M. A. Wiering and M. Van Otterlo. Reinforcement learning. Adaptation, learning, and optimization, 12(3):729, 2012.
- Principled methods for advising reinforcement learning agents. In Proceedings of the 20th international conference on machine learning (ICML-03), pages 792–799, 2003.
- Markov persuasion processes and reinforcement learning. In ACM Conference on Economics and Computation, 2022.
- Model-based constrained mdp for budget allocation in sequential incentive marketing. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pages 971–980, 2019.
- Optimal common contract with heterogeneous agents. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7309–7316, 2020.
- G. Yu and C.-J. Ho. Environment design for biased decision makers. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), 2022.
- H. Zhang and V. Conitzer. Automated dynamic mechanism design. Advances in Neural Information Processing Systems, 34:27785–27797, 2021.
- H. Zhang and D. C. Parkes. Value-based policy teaching with active indirect elicitation. In AAAI, volume 8, pages 208–214, 2008.
- H. Zhang and S. Zenios. A dynamic principal-agent model with hidden information: Sequential optimality through truthful state revelation. Operations Research, 56(3):681–696, 2008.
- Policy teaching through reward function learning. In Proceedings of the 10th ACM conference on Electronic commerce, pages 295–304, 2009.
- Efficient algorithms for planning with participation constraints. In Proceedings of the 23rd ACM Conference on Economics and Computation, pages 1121–1140, 2022a.
- Planning with participation constraints. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 5260–5267, 2022b.
- S. Zhuang and D. Hadfield-Menell. Consequences of misaligned ai. Advances in Neural Information Processing Systems, 33:15763–15773, 2020.
- Omer Ben-Porat (27 papers)
- Yishay Mansour (158 papers)
- Michal Moshkovitz (21 papers)
- Boaz Taitler (6 papers)