Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multi-armed bandits for resource efficient, online optimization of language model pre-training: the use case of dynamic masking (2203.13151v2)

Published 24 Mar 2022 in cs.CL, cs.LG, and stat.ML

Abstract: We design and evaluate a Bayesian optimization framework for resource efficient pre-training of Transformer-based LLMs (TLMs). TLM pre-training requires high computational resources and introduces many unresolved design choices, such as selecting its pre-training hyperparameters. We propose a multi-armed bandit framework for the sequential selection of TLM pre-training hyperparameters, aimed at optimizing LLM performance, in a resource efficient manner. We design a Thompson sampling algorithm, with a surrogate Gaussian process reward model of the Masked LLM (MLM) pre-training objective, for its sequential minimization. Instead of MLM pre-training with fixed masking probabilities, the proposed Gaussian process-based Thompson sampling (GP-TS) accelerates pre-training by sequentially selecting masking hyperparameters that improve performance. We empirically demonstrate how GP-TS pre-trains LLMs efficiently, i.e., it achieves lower MLM loss in fewer epochs, across a variety of settings. In addition, GP-TS pre-trained TLMs attain competitive downstream performance, while avoiding expensive hyperparameter grid search. GP-TS provides an interactive framework for efficient and optimized TLM pre-training that, by circumventing costly hyperparameter selection, enables substantial computational savings.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Iñigo Urteaga (13 papers)
  2. Moulay-Zaïdane Draïdia (1 paper)
  3. Tomer Lancewicki (8 papers)
  4. Shahram Khadivi (29 papers)
Citations (1)