The Effect of Network Width on the Performance of Large-batch Training (1806.03791v1)

Published 11 Jun 2018 in stat.ML, cs.DC, cs.LG, math.OC, and stat.CO

Abstract: Distributed implementations of mini-batch stochastic gradient descent (SGD) suffer from communication overheads, attributed to the high frequency of gradient updates inherent in small-batch training. Training with large batches can reduce these overheads; however, large batches can affect the convergence properties and generalization performance of SGD. In this work, we take a first step towards analyzing how the structure (width and depth) of a neural network affects the performance of large-batch training. We present new theoretical results which suggest that--for a fixed number of parameters--wider networks are more amenable to fast large-batch training compared to deeper ones. We provide extensive experiments on residual and fully-connected neural networks which suggest that wider networks can be trained using larger batches without incurring a convergence slow-down, unlike their deeper variants.

PDF Abstract

Summarize Bookmark Chat (Pro)

Authors (5)

Lingjiao Chen (27 papers)
Hongyi Wang (62 papers)
Jinman Zhao (20 papers)
Dimitris Papailiopoulos (59 papers)
Paraschos Koutris (41 papers)

Citations (22)

View on Semantic Scholar

The Effect of Network Width on the Performance of Large-batch Training (1806.03791v1)

Related Papers