Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Tune As You Scale: Hyperparameter Optimization For Compute Efficient Training (2306.08055v1)

Published 13 Jun 2023 in cs.LG and cs.AI

Abstract: Hyperparameter tuning of deep learning models can lead to order-of-magnitude performance gains for the same amount of compute. Despite this, systematic tuning is uncommon, particularly for large models, which are expensive to evaluate and tend to have many hyperparameters, necessitating difficult judgment calls about tradeoffs, budgets, and search bounds. To address these issues and propose a practical method for robustly tuning large models, we present Cost-Aware Pareto Region Bayesian Search (CARBS), a Bayesian optimization algorithm that performs local search around the performance-cost Pareto frontier. CARBS does well even in unbounded search spaces with many hyperparameters, learns scaling relationships so that it can tune models even as they are scaled up, and automates much of the "black magic" of tuning. Among our results, we effectively solve the entire ProcGen benchmark just by tuning a simple baseline (PPO, as provided in the original ProcGen paper). We also reproduce the model size vs. training tokens scaling result from the Chinchilla project (Hoffmann et al. 2022), while simultaneously discovering scaling laws for every other hyperparameter, via an easy automated process that uses significantly less compute and is applicable to any deep learning problem (not just LLMs).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (54)
  1. Cost-aware Multi-objective Bayesian optimisation. art. arXiv:1909.03600, September 2019. doi: 10.48550/arXiv.1909.03600.
  2. Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2019.
  3. Multi-Step Budgeted Bayesian Optimization with Unknown Evaluation Costs. art. arXiv:2111.06537, November 2021. doi: 10.48550/arXiv.2111.06537.
  4. Revisiting resnets: Improved training and scaling strategies. CoRR, abs/2103.07579, 2021. URL https://arxiv.org/abs/2103.07579.
  5. Openai gym, 2016.
  6. Language Models are Few-Shot Learners. art. arXiv:2005.14165, May 2020. doi: 10.48550/arXiv.2005.14165.
  7. Reproducible scaling laws for contrastive language-image learning. art. arXiv:2212.07143, December 2022. doi: 10.48550/arXiv.2212.07143.
  8. Leveraging Procedural Generation to Benchmark Reinforcement Learning. art. arXiv:1912.01588, December 2019. doi: 10.48550/arXiv.1912.01588.
  9. HEBO Pushing The Limits of Sample-Efficient Hyperparameter Optimisation. art. arXiv:2012.03826, December 2020. doi: 10.48550/arXiv.2012.03826.
  10. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pp.  702–703, 2020.
  11. Scaling Laws for Acoustic Models. art. arXiv:2106.09488, June 2021. doi: 10.48550/arXiv.2106.09488.
  12. Scalable global optimization via local bayesian optimization. CoRR, abs/1910.01739, 2019. URL http://arxiv.org/abs/1910.01739.
  13. FastAI. imagenette. https://github.com/fastai/imagenette, 2022.
  14. Cautious bayesian optimization for efficient and scalable policy search. In Proceedings of the 3rd Conference on Learning for Dynamics and Control, volume 144 of Proceedings of Machine Learning Research, pp.  227–240. PMLR, 07 – 08 June 2021.
  15. Pareto-efficient Acquisition Functions for Cost-Aware Bayesian Optimization. art. arXiv:2011.11456, November 2020. doi: 10.48550/arXiv.2011.11456.
  16. Completely derandomized self-adaptation in evolution strategies. Evolutionary Computation, 9:159–195, 06 2001. doi: 10.1162/106365601750190398.
  17. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  18. Scaling laws for autoregressive generative modeling. CoRR, abs/2010.14701, 2020. URL https://arxiv.org/abs/2010.14701.
  19. Deep learning scaling is predictable, empirically. CoRR, abs/1712.00409, 2017. URL http://arxiv.org/abs/1712.00409.
  20. Training Compute-Optimal Large Language Models. art. arXiv:2203.15556, March 2022. doi: 10.48550/arXiv.2203.15556.
  21. The 37 implementation details of proximal policy optimization. In ICLR Blog Track, 2022. URL https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/. https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/.
  22. Scaling Laws for Neural Language Models. art. arXiv:2001.08361, January 2020. doi: 10.48550/arXiv.2001.08361.
  23. Karpathy, A. mingpt. https://github.com/karpathy/minGPT, 2022.
  24. Cost-aware Bayesian Optimization. art. arXiv:2003.10870, March 2020. doi: 10.48550/arXiv.2003.10870.
  25. A Nonmyopic Approach to Cost-Constrained Bayesian Optimization. art. arXiv:2106.06079, June 2021. doi: 10.48550/arXiv.2106.06079.
  26. A System for Massively Parallel Hyperparameter Tuning. art. arXiv:1810.05934, October 2018. doi: 10.48550/arXiv.1810.05934.
  27. Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers. art. arXiv:2002.11794, February 2020. doi: 10.48550/arXiv.2002.11794.
  28. Tune: A research platform for distributed model selection and training. arXiv preprint arXiv:1807.05118, 2018.
  29. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.
  30. MosaicML. composer. https://github.com/mosaicml/composer/, 2021.
  31. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
  32. Multi-Information Source Optimization. art. arXiv:1603.00389, March 2016. doi: 10.48550/arXiv.1603.00389.
  33. Pytorch. Language modeling with nn.transformer and torchtext. https://pytorch.org/tutorials/beginner/transformer_tutorial.html, 2022a.
  34. Pytorch. Imagenet training in pytorch. https://github.com/pytorch/examples/tree/main/imagenet, 2022b.
  35. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. art. arXiv:1910.10683, October 2019. doi: 10.48550/arXiv.1910.10683.
  36. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020. URL http://jmlr.org/papers/v21/20-074.html.
  37. A Constructive Prediction of the Generalization Error Across Scales. art. arXiv:1909.12673, September 2019. doi: 10.48550/arXiv.1909.12673.
  38. HITY workshop poll, NeurIPS 2022. https://github.com/fsschneider/HITYWorkshopPoll, 2022.
  39. Proximal Policy Optimization Algorithms. art. arXiv:1707.06347, July 2017. doi: 10.48550/arXiv.1707.06347.
  40. Taking the human out of the loop: A review of bayesian optimization. Proceedings of the IEEE, 104(1):148–175, 2016. doi: 10.1109/JPROC.2015.2494218.
  41. Self-attention with relative position representations. arXiv preprint arXiv:1803.02155, 2018.
  42. Practical Bayesian Optimization of Machine Learning Algorithms. art. arXiv:1206.2944, June 2012. doi: 10.48550/arXiv.1206.2944.
  43. Multi-task bayesian optimization. In Burges, C., Bottou, L., Welling, M., Ghahramani, Z., and Weinberger, K. (eds.), Advances in Neural Information Processing Systems, volume 26, 2013. URL https://proceedings.neurips.cc/paper/2013/file/f33ba15effa5c10e873bf3842afb46a6-Paper.pdf.
  44. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  2818–2826, 2016.
  45. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. art. arXiv:1905.11946, May 2019. doi: 10.48550/arXiv.1905.11946.
  46. Scale efficiently: Insights from pre-training and fine-tuning transformers. CoRR, abs/2109.10686, 2021. URL https://arxiv.org/abs/2109.10686.
  47. Bayesian optimization is superior to random search for machine learning hyperparameter tuning: Analysis of the black-box optimization challenge 2020. CoRR, abs/2104.10201, 2021. URL https://arxiv.org/abs/2104.10201.
  48. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  49. Economical hyperparameter optimization with blended search strategy. In The Ninth International Conference on Learning Representations (ICLR 2021), May 2021.
  50. Bayesian Optimization in a Billion Dimensions via Random Embeddings. art. arXiv:1301.1942, January 2013. doi: 10.48550/arXiv.1301.1942.
  51. EnvPool: A highly parallel reinforcement learning environment execution engine. arXiv preprint arXiv:2206.10558, 2022.
  52. Practical Multi-fidelity Bayesian Optimization for Hyperparameter Tuning. art. arXiv:1903.04703, March 2019. doi: 10.48550/arXiv.1903.04703.
  53. Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer. art. arXiv:2203.03466, March 2022. doi: 10.48550/arXiv.2203.03466.
  54. Scaling vision transformers. CoRR, abs/2106.04560, 2021. URL https://arxiv.org/abs/2106.04560.
Citations (4)

Summary

We haven't generated a summary for this paper yet.

Youtube Logo Streamline Icon: https://streamlinehq.com