Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Trajectory-wise Iterative Reinforcement Learning Framework for Auto-bidding (2402.15102v2)

Published 23 Feb 2024 in cs.LG, cs.AI, cs.GT, and cs.IR

Abstract: In online advertising, advertisers participate in ad auctions to acquire ad opportunities, often by utilizing auto-bidding tools provided by demand-side platforms (DSPs). The current auto-bidding algorithms typically employ reinforcement learning (RL). However, due to safety concerns, most RL-based auto-bidding policies are trained in simulation, leading to a performance degradation when deployed in online environments. To narrow this gap, we can deploy multiple auto-bidding agents in parallel to collect a large interaction dataset. Offline RL algorithms can then be utilized to train a new policy. The trained policy can subsequently be deployed for further data collection, resulting in an iterative training framework, which we refer to as iterative offline RL. In this work, we identify the performance bottleneck of this iterative offline RL framework, which originates from the ineffective exploration and exploitation caused by the inherent conservatism of offline RL algorithms. To overcome this bottleneck, we propose Trajectory-wise Exploration and Exploitation (TEE), which introduces a novel data collecting and data utilization method for iterative offline RL from a trajectory perspective. Furthermore, to ensure the safety of online exploration while preserving the dataset quality for TEE, we propose Safe Exploration by Adaptive Action Selection (SEAS). Both offline experiments and real-world experiments on Alibaba display advertising platform demonstrate the effectiveness of our proposed method.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Agent57: Outperforming the atari human benchmark. In International conference on machine learning. PMLR, 507–517.
  2. Unifying count-based exploration and intrinsic motivation. Advances in neural information processing systems 29 (2016).
  3. Offline rl without off-policy evaluation. Advances in neural information processing systems 34 (2021), 4933–4946.
  4. Real-time bidding by reinforcement learning in display advertising. In Proceedings of the tenth ACM international conference on web search and data mining. 661–670.
  5. Jinglin Chen and Nan Jiang. 2019. Information-Theoretic Considerations in Batch Reinforcement Learning. In Proceedings of the 36th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 97). PMLR, 1042–1051. https://proceedings.mlr.press/v97/chen19e.html
  6. Intrinsically motivated reinforcement learning. Advances in neural information processing systems 17 (2004).
  7. Internet advertising and the generalized second-price auction: Selling billions of dollars worth of keywords. American economic review 97, 1 (2007), 242–259.
  8. Noisy Networks For Exploration. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings.
  9. Scott Fujimoto and Shixiang Shane Gu. 2021. A minimalist approach to offline reinforcement learning. Advances in neural information processing systems 34 (2021), 20132–20145.
  10. Off-policy deep reinforcement learning without exploration. In International conference on machine learning. PMLR, 2052–2062.
  11. Addressing Function Approximation Error in Actor-Critic Methods. In Proceedings of the 35th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 80), Jennifer Dy and Andreas Krause (Eds.). PMLR, 1587–1596. https://proceedings.mlr.press/v80/fujimoto18a.html
  12. A unified solution to constrained bidding in online display advertising. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 2993–3001.
  13. Rainbow: Combining improvements in deep reinforcement learning. In Proceedings of the AAAI conference on artificial intelligence, Vol. 32.
  14. Harnessing Mixed Offline Reinforcement Learning Datasets via Trajectory Weighting. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023.
  15. Offline Reinforcement Learning with Implicit Q-Learning. In International Conference on Learning Representations.
  16. Stabilizing off-policy q-learning via bootstrapping error reduction. Advances in Neural Information Processing Systems 32 (2019).
  17. Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems 33 (2020), 1179–1191.
  18. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643 (2020).
  19. Continuous control with deep reinforcement learning. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.).
  20. Deployment-Efficient Reinforcement Learning via Model-Based Offline Optimization. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021.
  21. Sustainable Online Reinforcement Learning for Auto-bidding. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022.
  22. Count-based exploration with neural density models. In International conference on machine learning. PMLR, 2721–2730.
  23. Curiosity-driven exploration by self-supervised prediction. In International conference on machine learning. PMLR, 2778–2787.
  24. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177 (2019).
  25. Parameter Space Noise for Exploration. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings.
  26. A survey on offline reinforcement learning: Taxonomy, review, and open problems. IEEE Transactions on Neural Networks and Learning Systems (2023).
  27. Bridging offline reinforcement learning and imitation learning: A tale of pessimism. Advances in Neural Information Processing Systems 34 (2021), 11702–11716.
  28. A dataset perspective on offline reinforcement learning. In Conference on Lifelong Learning Agents. PMLR, 470–517.
  29. Richard S Sutton and Andrew G Barto. 2018. Reinforcement learning: An introduction. MIT press.
  30. Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8 (1992), 229–256.
  31. Budget constrained bidding by model-free reinforcement learning in display advertising. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management. 1443–1451.
  32. Offline Prioritized Experience Replay. arXiv preprint arXiv:2306.05412 (2023).
  33. Boosting offline reinforcement learning via data rebalancing. arXiv preprint arXiv:2210.09241 (2022).
  34. Offline Reinforcement Learning with Realizability and Single-policy Concentrability. In Proceedings of Thirty Fifth Conference on Learning Theory (Proceedings of Machine Learning Research, Vol. 178), Po-Ling Loh and Maxim Raginsky (Eds.). PMLR, 2730–2775. https://proceedings.mlr.press/v178/zhan22a.html
  35. Text-Based Interactive Recommendation via Offline Reinforcement Learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 11694–11702.
  36. Optimal real-time bidding frameworks discussion. arXiv preprint arXiv:1602.01007 (2016).
  37. Optimal real-time bidding for display advertising. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. 1077–1086.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com