Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Effect of Network Width on the Performance of Large-batch Training (1806.03791v1)

Published 11 Jun 2018 in stat.ML, cs.DC, cs.LG, math.OC, and stat.CO

Abstract: Distributed implementations of mini-batch stochastic gradient descent (SGD) suffer from communication overheads, attributed to the high frequency of gradient updates inherent in small-batch training. Training with large batches can reduce these overheads; however, large batches can affect the convergence properties and generalization performance of SGD. In this work, we take a first step towards analyzing how the structure (width and depth) of a neural network affects the performance of large-batch training. We present new theoretical results which suggest that--for a fixed number of parameters--wider networks are more amenable to fast large-batch training compared to deeper ones. We provide extensive experiments on residual and fully-connected neural networks which suggest that wider networks can be trained using larger batches without incurring a convergence slow-down, unlike their deeper variants.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Lingjiao Chen (27 papers)
  2. Hongyi Wang (62 papers)
  3. Jinman Zhao (20 papers)
  4. Dimitris Papailiopoulos (59 papers)
  5. Paraschos Koutris (41 papers)
Citations (22)

Summary

We haven't generated a summary for this paper yet.