Self-Improvement for Neural Combinatorial Optimization: Sample without Replacement, but Improvement (2403.15180v2)
Abstract: Current methods for end-to-end constructive neural combinatorial optimization usually train a policy using behavior cloning from expert solutions or policy gradient methods from reinforcement learning. While behavior cloning is straightforward, it requires expensive expert solutions, and policy gradient methods are often computationally demanding and complex to fine-tune. In this work, we bridge the two and simplify the training process by sampling multiple solutions for random instances using the current model in each epoch and then selecting the best solution as an expert trajectory for supervised imitation learning. To achieve progressively improving solutions with minimal sampling, we introduce a method that combines round-wise Stochastic Beam Search with an update strategy derived from a provable policy improvement. This strategy refines the policy between rounds by utilizing the advantage of the sampled sequences with almost no computational overhead. We evaluate our approach on the Traveling Salesman Problem and the Capacitated Vehicle Routing Problem. The models trained with our method achieve comparable performance and generalization to those trained with expert data. Additionally, we apply our method to the Job Shop Scheduling Problem using a transformer-based architecture and outperform existing state-of-the-art methods by a wide margin.
- Learning what to defer for maximum independent sets. In International conference on machine learning, pp. 134–144. PMLR, 2020.
- The traveling salesman problem: a computational study. Princeton University Press, 2006.
- Rezero is all you need: Fast convergence at large depth. In Uncertainty in Artificial Intelligence, pp. 1352–1361. PMLR, 2021.
- Neural combinatorial optimization with reinforcement learning. arXiv preprint arXiv:1611.09940, 2016.
- Machine learning for combinatorial optimization: a methodological tour d’horizon. European Journal of Operational Research, 290(2):405–421, 2021.
- Learning to solve vehicle routing problems: A survey. arXiv preprint arXiv:2205.02453, 2022.
- Combinatorial optimization with policy adaptation using latent space search. Advances in Neural Information Processing Systems, 36, 2024.
- Learning to perform local rewriting for combinatorial optimization. Advances in neural information processing systems, 32, 2019.
- Simulation-guided beam search for neural combinatorial optimization. Advances in Neural Information Processing Systems, 35:8760–8772, 2022.
- Self-labeling the job shop scheduling problem. arXiv preprint arXiv:2401.11849, 2024.
- Learning 2-opt heuristics for the traveling salesman problem via deep reinforcement learning. In Asian conference on machine learning, pp. 465–480. PMLR, 2020.
- Policy improvement by planning with gumbel. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=bERaNdoegnO.
- A tutorial on the cross-entropy method. Annals of operations research, 134:19–67, 2005.
- Learning heuristics for the tsp by policy gradient. In Integration of Constraint Programming, Artificial Intelligence, and Operations Research: 15th International Conference, CPAIOR 2018, Delft, The Netherlands, June 26–29, 2018, Proceedings 15, pp. 170–181. Springer, 2018.
- Bq-nco: Bisimulation quotienting for efficient neural combinatorial optimization. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Generalize a small pre-trained model to arbitrarily large tsp instances. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pp. 7474–7482, 2021.
- Winner takes it all: Training performant rl populations for combinatorial optimization. Advances in Neural Information Processing Systems, 36, 2024.
- E.J. Gumbel. Statistical theory of extreme values and some practical applications: a series of lectures, volume 33. US Government Printing Office, 1954.
- Deep reinforcement learning that matters. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.
- The curious case of neural text degeneration. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=rygGQyrFvH.
- Neural large neighborhood search for the capacitated vehicle routing problem. In 24th European Conference on Artificial Intelligence (ECAI 2020), 2020.
- Learning a latent search space for routing problems using variational autoencoders. In International Conference on Learning Representations, 2020.
- Efficient active search for combinatorial optimization problems. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=nO5caZwFwYu.
- A review of the gumbel-max trick and its extensions for discrete stochasticity in machine learning. IEEE PAMI, 45(2):1353–1371, 2022.
- An efficient graph convolutional network technique for the travelling salesman problem. arXiv preprint arXiv:1906.01227, 2019.
- Learning the travelling salesperson problem requires rethinking generalization. arXiv preprint arXiv:2006.07054, 2020.
- Learning collaborative policies to solve np-hard routing problems. Advances in Neural Information Processing Systems, 34:10418–10430, 2021.
- Sym-NCO: Leveraging symmetricity for neural combinatorial optimization. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=kHrE2vi5Rvs.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Buy 4 reinforce samples, get a baseline for free! In International Conference on Learning Representations, 2019a.
- Estimating gradients for discrete random variables by sampling without replacement. In International Conference on Learning Representations, 2020.
- Attention, learn to solve routing problems! In International Conference on Learning Representations, 2019b. URL https://openreview.net/forum?id=ByxBFsRqYm.
- Stochastic beams and where to find them: The gumbel-top-k trick for sampling sequences without replacement. In International Conference on Machine Learning, pp. 3499–3508. PMLR, 2019c.
- Deep policy dynamic programming for vehicle routing problems. In International conference on integration of constraint programming, artificial intelligence, and operations research, pp. 190–213. Springer, 2022.
- Pomo: Policy optimization with multiple optima for reinforcement learning. Advances in Neural Information Processing Systems, 33:21188–21198, 2020.
- A simple, fast diverse decoding algorithm for neural generation. arXiv preprint arXiv:1611.08562, 2016.
- Neural combinatorial optimization with heavy decoder: Toward large scale generalization. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Learning to iteratively solve routing problems with dual-aspect collaborative transformer. Advances in Neural Information Processing Systems, 34:11096–11107, 2021.
- A* sampling. Advances in neural information processing systems, 27, 2014.
- Reinforcement learning for solving the vehicle routing problem. Advances in neural information processing systems, 31, 2018.
- Lightzero: A unified benchmark for monte carlo tree search in general sequential decision scenarios. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://openreview.net/forum?id=oIUXpBnyjv.
- Learning to schedule job-shop problems: representation and policy learning using graph neural network and reinforcement learning. International Journal of Production Research, 59(11):3360–3377, 2021.
- Schedulenet: Learn to solve multi-agent scheduling problems with reinforcement learning, 2022. URL https://openreview.net/forum?id=nWlk4jwupZ.
- Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
- Policy-based self-competition for planning problems. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=SmufNDN90G.
- Train short, test long: Attention with linear biases enables input length extrapolation. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=R8sQPpGCv0.
- Self-critical sequence training for image captioning. IEEE CVPR, 2017.
- Reuven Rubinstein. The cross-entropy method for combinatorial and continuous optimization. Methodology and computing in applied probability, 1:127–190, 1999.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Incremental sampling without replacement for sequence models. In International Conference on Machine Learning, pp. 8785–8795. PMLR, 2020.
- E. Taillard. Benchmarks for basic scheduling problems. European Journal of Operational Research, 64:278–285, 1993.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Thibaut Vidal. Hybrid genetic search for the cvrp: Open-source implementation and swap* neighborhood. Computers & Operations Research, 140:105643, 2022.
- T. Vieira. Gumbel-max trick and weighted reservoir sampling. http://timvieira.github.io/blog/post/2014/08/01/gumbel-max-trick-and-weighted-reservoir-sampling/, 2014.
- Diverse beam search for improved description of complex scenes. AAAI, 2018.
- Pointer networks. Advances in neural information processing systems, 28, 2015.
- R.J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8:229–256, 1992.
- Learning improvement heuristics for solving routing problems. IEEE transactions on neural networks and learning systems, 33(9):5057–5069, 2021.
- Multi-decoder attention model with embedding glimpse for solving vehicle routing problems. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 12042–12049, 2021.
- J.I. Yellott Jr. The relationship between luce’s choice axiom, thurstone’s theory of comparative judgment, and the double exponential distribution. Journal of Mathematical Psychology, 15(2):109–144, 1977.
- Learning to dispatch for job shop scheduling via deep reinforcement learning. Advances in Neural Information Processing Systems, 33:1621–1632, 2020.
- Deep reinforcement learning guided improvement heuristic for job shop scheduling. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=jsWCmrsHHs.
- An end-to-end deep reinforcement learning approach for job shop scheduling. In 2022 IEEE 25th International Conference on Computer Supported Cooperative Work in Design (CSCWD), pp. 841–846, 2022. doi: 10.1109/CSCWD54268.2022.9776116.
- Jonathan Pirnay (7 papers)
- Dominik G. Grimm (7 papers)