Optimal allocation of prompts versus generations per prompt at very large batch sizes
Determine, at very large global batch sizes (e.g., 2k+ completions per step), whether allocating more prompts versus more generations per prompt yields superior asymptotic performance and compute efficiency. Construct a principled rule for this allocation under a fixed total batch.
References
For a fixed total batch, is it better to allocate more prompts or more generations per prompt? Sweeping generations per prompt {8,16,24,32} and adjusting prompts to keep total batch fixed leaves fitted scaling curves essentially unchanged (Appendix~\ref{appendix:large_scale}), suggesting that, at moderate batch, this allocation is a second-order choice for both A and B. Clearer differences may emerge at much larger batches (e.g., 2k+), which we leave for future work.