Risk-Aware Linear Bandits: Theory and Applications in Smart Order Routing
Abstract: Motivated by practical considerations in machine learning for financial decision-making, such as risk aversion and large action space, we consider risk-aware bandits optimization with applications in smart order routing (SOR). Specifically, based on preliminary observations of linear price impacts made from the NASDAQ ITCH dataset, we initiate the study of risk-aware linear bandits. In this setting, we aim at minimizing regret, which measures our performance deficit compared to the optimum's, under the mean-variance metric when facing a set of actions whose rewards are linear functions of (initially) unknown parameters. Driven by the variance-minimizing globally-optimal (G-optimal) design, we propose the novel instance-independent Risk-Aware Explore-then-Commit (RISE) algorithm and the instance-dependent Risk-Aware Successive Elimination (RISE++) algorithm. Then, we rigorously analyze their near-optimal regret upper bounds to show that, by leveraging the linear structure, our algorithms can dramatically reduce the regret when compared to existing methods. Finally, we demonstrate the performance of the algorithms by conducting extensive numerical experiments in the SOR setup using both synthetic datasets and the NASDAQ ITCH dataset. Our results reveal that 1) The linear structure assumption can indeed be well supported by the Nasdaq dataset; and more importantly 2) Both RISE and RISE++ can significantly outperform the competing methods, in terms of regret, especially in complex decision-making scenarios.
- Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24, 2011.
- Adaptive market making via online learning. Advances in Neural Information Processing Systems, 26, 2013.
- Optimal allocation strategies for the dark pool problem. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 9–16. JMLR Workshop and Conference Proceedings, 2010.
- Mnl-bandit: A dynamic learning approach to assortment selection. Operations Research, 67(5):1453–1485, 2019.
- Optimal execution of portfolio transactions. Journal of Risk, 3:5–40, 2001.
- Robo-advising: Learning investors’ risk preferences via portfolio choices. Journal of Financial Econometrics, 19(2):369–392, 2021.
- Peter Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397–422, 2002.
- Adaptive trading strategies across liquidity pools. arXiv preprint arXiv:2008.07807, 2020.
- Dark-pool smart order routing: a combinatorial multi-armed bandit approach. In Proceedings of the Third ACM International Conference on AI in Finance, pages 352–360, 2022.
- martinobdl/itch: Itch50converter, 2021. URL https://zenodo.org/record/5209267.
- No weighted-regret learning in adversarial bandits with delays. The Journal of Machine Learning Research, 23(1):6205–6247, 2022.
- Delay-adaptive learning in generalized linear contextual bandits. Mathematics of Operations Research, 2023.
- Jean-Philippe Bouchaud. Price impact. arXiv preprint arXiv:0903.2428, 2009.
- Incorporating order-flow into optimal execution. Mathematics and Financial Economics, 10:339–364, 2016.
- Hedging the drift: Learning to optimize under nonstationarity. Management Science, 68(3):1696–1713, 2022.
- Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 208–214. JMLR Workshop and Conference Proceedings, 2011.
- Reinforcement learning with dynamic convex risk measures. Mathematical Finance, 2023.
- Optimal order placement in limit order markets. Quantitative Finance, 17(1):21–39, 2017.
- The price impact of order book events. Journal of financial econometrics, 12(1):47–88, 2014.
- Introduction to algorithms. In MIT Press, 2009.
- Stochastic linear optimization under bandit feedback. 2008.
- Censored exploration and the dark pool problem. Communications of the ACM, 53(5):99–107, 2010.
- Recent advances in reinforcement learning in finance. arXiv preprint arXiv:2112.04553, 2021.
- Stay with me: Lifetime maximization through heteroscedastic linear bandits with reneging. In International Conference on Machine Learning, pages 2800–2809. PMLR, 2019.
- Risk-aware multi-armed bandit problem with application to portfolio selection. Royal Society open science, 4(11):171377, 2017.
- JPMS Frequently Asked Questions – US Equities. J.P. Morgan Securities LLC Electronic Trading: Frequently Asked Questions – US Equities. 2021. https://www.jpmorgan.com/content/dam/jpm/cib/complex/content/markets/aqua/US_Electronic_Trading_FAQs.pdf.
- Robustness implies generalization via data-dependent generalization bounds. In International Conference on Machine Learning, pages 10866–10894. PMLR, 2022.
- Optimal split of orders across liquidity pools: a stochastic algorithm approach. SIAM Journal on Financial Mathematics, 2(1):1042–1076, 2011.
- Bandit algorithms. Cambridge University Press, 2020.
- A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661–670, 2010.
- A trade execution model under a composite dynamic coherent risk measure. Operations research letters, 43(1):52–58, 2015.
- Harry Markowitz. Portfolio selection. The Journal of Finance, 7(1):7791, 1952.
- Nasdaq. Tradetalks nasdaq psx is the most unique equity exchange. 2017. https://www.nasdaq.com/articles/tradetalks-nasdaq-psx-is-the-most-unique-equity-exchange-2017-04-19.
- Nasdaq. Indroduction to nasdaq bx. 2022a. https://www.nasdaq.com/solutions/nasdaq-bx-stock-market.
- Nasdaq. Indroduction to nasdaq psx. 2022b. https://www.nasdaq.com/solutions/nasdaq-psx-stock-market.
- NASDAQ ITCH Data. Nasdaq itch data. 2022. https://emi.nasdaq.com/ITCH/.
- Bandits with delayed, aggregated anonymous feedback. In International Conference on Machine Learning, pages 4105–4113. PMLR, 2018.
- Tobias Preis. Price-time priority and pro rata matching in an order book model of financial markets. In Econophysics of Order-driven Markets, pages 65–72. Springer, 2011.
- Rosenblatt Securities. Let there be light - us edition: Market structure report. 2022. https://www.rblt.com/market-reports/let-there-be-light-us-edition-42.
- Mark Rubinstein. Markowitz’s “portfolio selection”: A fifty-year retrospective. The Journal of Finance, 57(3):1041–1045, 2002.
- Risk-aversion in multi-armed bandits. In NIPS, 2012.
- Portfolio choices with orthogonal bandit learning. In Twenty-fourth international joint conference on artificial intelligence, 2015.
- Navin Goyal Shipra Agrawal. Thompson sampling for contextual bandits with linear payoffs. In Proceedings of the 30th International Conference on Machine Learning, 2013.
- Distributionally robust batch contextual bandits. Management Science, 2023.
- Non-stationary experimental design under structured trends. Available at SSRN 4514568, 2023a.
- Stochastic multi-armed bandits: Optimal trade-off among optimality, consistency, and tail risk. In Thirty-seventh Conference on Neural Information Processing Systems, 2023b.
- A survey of risk-aware multi-armed bandits. In IJCAI-ECAI, 2022.
- ¯¯absent\underline{~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}}under¯ start_ARG end_ARG. Risk-aware linear bandits with application in smart order routing. In Proceedings of the Third ACM International Conference on AI in Finance, pages 334–342, 2022.
- Risk-averse multi-armed bandit problems under mean-variance measure. IEEE Journal of Selected Topics in Signal Processing, 10(6):1093–1111, Sep 2016. ISSN 1941-0484. 10.1109/jstsp.2016.2592622. URL http://dx.doi.org/10.1109/JSTSP.2016.2592622.
- Reinforcement learning in healthcare: A survey. ACM Computing Surveys (CSUR), 55(1):1–36, 2021.
- Markowitz’s mean-variance portfolio selection with regime switching: A continuous-time model. SIAM Journal on Control and Optimization, 42(4):1466–1482, 2003.
- When demands evolve larger and noisier: Learning and earning in a growing environment. In International Conference on Machine Learning, pages 11629–11638. PMLR, 2020.
- Qiuyu Zhu and Vincent Y. F. Tan. Thompson sampling algorithms for mean-variance bandits. In Proceedings of the 37th International Conference on Machine Learning, 2020.
- Safe data collection for offline and online policy learning. In arXiv:2111.04835 [cs.LG], 2022.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.