Neural Inventory Control in Networks via Hindsight Differentiable Policy Optimization (2306.11246v2)
Abstract: We argue that inventory management presents unique opportunities for reliably applying and evaluating deep reinforcement learning (DRL). Toward reliable application, we emphasize and test two techniques. The first is Hindsight Differentiable Policy Optimization (HDPO), which performs stochastic gradient descent to optimize policy performance while avoiding the need to repeatedly deploy randomized policies in the environment-as is common with generic policy gradient methods. Our second technique involves aligning policy (neural) network architectures with the structure of the inventory network. Specifically, we focus on a network with a single warehouse that consolidates inventory from external suppliers, holds it, and then distributes it to many stores as needed. In this setting, we introduce the symmetry-aware policy network architecture. We motivate this architecture by establishing an asymptotic performance guarantee and empirically demonstrate its ability to reduce the amount of data needed to uncover strong policies. Both techniques exploit structures inherent in inventory management problems, moving beyond generic DRL algorithms. Toward rigorous evaluation, we create and share new benchmark problems, divided into two categories. One type focuses on problems with hidden structures that allow us to compute or bound the cost of the true optimal policy. Across four problems of this type, we find HDPO consistently attains near-optimal performance, handling up to 60-dimensional raw state vectors effectively. The other type of evaluation involves constructing a test problem using real time series data from a large retailer, where the optimum is poorly understood. Here, we find HDPO methods meaningfully outperform a variety of generalized newsvendor heuristics. Our code can be found at github.com/MatiasAlvo/Neural_inventory_control.
- Studies in the mathematical theory of inventory and production. 1958.
- Dimitri Bertsekas. Dynamic programming and optimal control: Volume I, volume 1. Athena scientific, 2012.
- Dimitri P Bertsekas. Dynamic programming and optimal control 3rd edition, volume ii. 2011.
- How much over-parameterization is sufficient to learn deep relu networks? arXiv preprint arXiv:1911.12360, 2019.
- Optimal policies for a multi-echelon inventory problem. Management science, 6(4):475–490, 1960.
- Stock-outs cause walkouts. Harvard Business Review, 82(5):26–28, 2004.
- Queueing network controls via deep reinforcement learning. Stochastic Systems, 12(1):30–67, 2022.
- Approximations of dynamic, multilocation production and inventory problems. Management Science, 30(1):69–84, 1984a.
- Computational issues in an infinite-horizon, multiechelon inventory model. Operations Research, 32(4):818–836, 1984b.
- Scalable deep reinforcement learning for ride-hailing. In 2021 American Control Conference (ACC), pages 3743–3748. IEEE, 2021.
- Brax–a differentiable physics engine for large scale rigid body simulation. arXiv preprint arXiv:2106.13281, 2021.
- Inventory management in supply chains: a reinforcement learning approach. International Journal of Production Economics, 78(2):153–161, 2002.
- Can deep reinforcement learning improve inventory management? performance on dual sourcing, lost sales and multi-echelon problems. Manufacturing & Service Operations Management, 2021.
- Sensitivity analysis for base-stock levels in multiechelon production-inventory systems. Management Science, 41(2):263–281, 1995.
- Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5(9), 2004.
- Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905, 2018.
- Deep reinforcement learning that matters. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.
- Difftaichi: Differentiable programming for physical simulation. arXiv preprint arXiv:1910.00935, 2019.
- The 37 implementation details of proximal policy optimization. The ICLR Blog Track 2023, 2022.
- Sham M Kakade. A natural policy gradient. Advances in neural information processing systems, 14, 2001.
- Illya Kaynov. Deep reinforcement learning for asymmetric one-warehouse multi-retailer inventory management.
- Actor-critic algorithms. Advances in neural information processing systems, 12, 1999.
- End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334–1373, 2016.
- Deep inventory management. arXiv preprint arXiv:2210.03137, 2022.
- Simple random search provides a competitive approach to reinforcement learning. arXiv preprint arXiv:1803.07055, 2018.
- Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015.
- Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937. PMLR, 2016.
- Movi: A model-free approach to dynamic fleet management. In IEEE INFOCOM 2018-IEEE Conference on Computer Communications, pages 2708–2716. IEEE, 2018.
- A deep q-network for the beer game: Deep reinforcement learning for inventory optimization. Manufacturing & Service Operations Management, 24(1):285–304, 2022.
- Warren B Powell. Approximate Dynamic Programming: Solving the curses of dimensionality, volume 703. John Wiley & Sons, 2007.
- The optimality of (s, s) policies in the dynamic inventory problem. Optimal pricing, inflation, and the cost of price adjustment, pages 49–56, 1960.
- Mastering the game of go without human knowledge. nature, 550(7676):354–359, 2017.
- Hindsight learning for mdps with exogenous inputs. arXiv preprint arXiv:2207.06272, 2022.
- A deep value-network based approach for multi-driver order dispatching. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pages 1780–1790, 2019.
- A neuro-dynamic programming approach to retailer inventory management. In Proceedings of the 36th IEEE Conference on Decision and Control, volume 4, pages 4052–4057. IEEE, 1997.
- Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Reinforcement learning, pages 5–32, 1992.
- Linwei Xin. Understanding the performance of capped base-stock policies in lost-sales inventory models. Operations Research, 69(1):61–70, 2021.
- Paul Zipkin. Old and new methods for lost-sales inventory systems. Operations research, 56(5):1256–1263, 2008.
- Matias Alvo (1 paper)
- Daniel Russo (51 papers)
- Yash Kanoria (13 papers)