Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Effect of Network Width on the Performance of Large-batch Training (1806.03791v1)

Published 11 Jun 2018 in stat.ML, cs.DC, cs.LG, math.OC, and stat.CO

Abstract: Distributed implementations of mini-batch stochastic gradient descent (SGD) suffer from communication overheads, attributed to the high frequency of gradient updates inherent in small-batch training. Training with large batches can reduce these overheads; however, large batches can affect the convergence properties and generalization performance of SGD. In this work, we take a first step towards analyzing how the structure (width and depth) of a neural network affects the performance of large-batch training. We present new theoretical results which suggest that--for a fixed number of parameters--wider networks are more amenable to fast large-batch training compared to deeper ones. We provide extensive experiments on residual and fully-connected neural networks which suggest that wider networks can be trained using larger batches without incurring a convergence slow-down, unlike their deeper variants.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Lingjiao Chen (27 papers)
  2. Hongyi Wang (62 papers)
  3. Jinman Zhao (20 papers)
  4. Dimitris Papailiopoulos (59 papers)
  5. Paraschos Koutris (41 papers)
Citations (22)