Hedging using reinforcement learning: Contextual $k$-Armed Bandit versus $Q$-learning (2007.01623v2)
Abstract: The construction of replication strategies for contingent claims in the presence of risk and market friction is a key problem of financial engineering. In real markets, continuous replication, such as in the model of Black, Scholes and Merton (BSM), is not only unrealistic but it is also undesirable due to high transaction costs. A variety of methods have been proposed to balance between effective replication and losses in the incomplete market setting. With the rise of AI, AI-based hedgers have attracted considerable interest, where particular attention was given to Recurrent Neural Network systems and variations of the $Q$-learning algorithm. From a practical point of view, sufficient samples for training such an AI can only be obtained from a simulator of the market environment. Yet if an agent was trained solely on simulated data, the run-time performance will primarily reflect the accuracy of the simulation, which leads to the classical problem of model choice and calibration. In this article, the hedging problem is viewed as an instance of a risk-averse contextual $k$-armed bandit problem, which is motivated by the simplicity and sample-efficiency of the architecture. This allows for realistic online model updates from real-world data. We find that the $k$-armed bandit model naturally fits to the Profit and Loss formulation of hedging, providing for a more accurate and sample efficient approach than $Q$-learning and reducing to the Black-Scholes model in the absence of transaction costs and risks.
- Thinking fast and slow with deep learning and tree search. Advances in Neural Information Processing Systems, 30, 2017.
- Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2-3):235–256, 2002.
- F. Black and M. Scholes. The pricing of options and corporate liabilities. Journal of Political Economy, 81(3):637–654, 1973.
- Deep hedging. Quantitative Finance, 19(8):1271–1291, 2019.
- Deep hedging: hedging derivatives under generic market frictions using reinforcement learning. Swiss Finance Institute, 2019.
- Deep hedging of derivatives using reinforcement learning. The Journal of Financial Data Science, 3(1):10–27, 2021.
- O. Chapelle and L. Li. An empirical evaluation of thompson sampling. Advances in Neural Information Processing Systems, 24:2249–2257, 2011.
- D. Duffie. Dynamic Asset Pricing Theory. 2001.
- Deep learning for real-time atari game play using offline monte-carlo tree search planning. Advances in Neural Information Processing Systems, pages 3338–3346, 2014.
- I. Halperin. Qlbs: Q-learner in the black-scholes (-merton) worlds. The Journal of Derivatives, 28(1):99–122, 2020.
- Rainbow: Combining improvements in deep reinforcement learning. Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
- S. Hodges and A. Neuberger. Option replication of contingent claims under transactions costs. Technical Report, Financial Options Research Centre,University of Warwick, 1989.
- P. N. Kolm and G. Ritter. Dynamic replication and hedging: A reinforcement learning approach. The Journal of Financial Data Science, 1(1):159–171, 2019.
- L.J Lin. Reinforcement learning for robots using neural networks. Carnegie Mellon University, 1992.
- R. C. Merton. Theory of rational option pricing. The Bell Journal of Economics and Management Science, pages 141–183, 1973.
- Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
- Deep exploration via bootstrapped dqn. Advances in Neural Information Processing Systems, pages 4026–4034, 2016.
- M. L. Puterman. Markov decision processes: Discrete stochastic dynamic programming. John Wiley & Sons, Inc., 1994.
- Deep bayesian bandits showdown: An empirical comparison of bayesian deep networks for thompson sampling. arXiv preprint arXiv:1802.09127, 2018.
- G. Ritter. Machine learning for trading. Available at SSRN 3015609, 2017.
- Risk-aversion in multi-armed bandits. Advances in Neural Information Processing Systems, pages 3275–3283, 2012.
- V. Sattar and Z. Qing. Risk-averse multi-armed bandit problems under mean-variance measure. IEEE Journal of Selected Topics in Signal Processing, 10(6):1093–1111, 2016.
- M. Schweizer. Variance-optimal hedging in discrete time. Mathematics of Operations Research, 20(1):1–32, 1995.
- Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484, 2016.
- Mastering the game of go without human knowledge. Nature, 550(7676):354–359, 2017.
- Scalable bayesian optimization using deep neural networks. International Conference on Machine Learning, pages 2171–2180, 2015.
- R. S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 3(1):9–44, 1988.
- Reinforcement learning: An introduction. MIT press, 2018.
- O. Szehr. Hedging of financial derivative contracts via monte carlo tree search. arXiv preprint arXiv:2102.06274, 2021.
- W. R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.
- J. N. Tsitsiklis and B. Van Roy. Analysis of temporal-difference learning with function approximation. Advances in Neural Information Processing Systems, pages 1075–1081, 1997.
- Deep reinforcement learning with double q-learning. Thirtieth AAAI conference on artificial intelligence, 2016.
- Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019.
- Q-learning. Machine learning, 8(3-4):279–292, 1992.