A Neural Scaling Law from Lottery Ticket Ensembling (2310.02258v2)
Abstract: Neural scaling laws (NSL) refer to the phenomenon where model performance improves with scale. Sharma & Kaplan analyzed NSL using approximation theory and predict that MSE losses decay as $N{-\alpha}$, $\alpha=4/d$, where $N$ is the number of model parameters, and $d$ is the intrinsic input dimension. Although their theory works well for some cases (e.g., ReLU networks), we surprisingly find that a simple 1D problem $y=x2$ manifests a different scaling law ($\alpha=1$) from their predictions ($\alpha=4$). We opened the neural networks and found that the new scaling law originates from lottery ticket ensembling: a wider network on average has more "lottery tickets", which are ensembled to reduce the variance of outputs. We support the ensembling mechanism by mechanistically interpreting single neural networks, as well as studying them statistically. We attribute the $N{-1}$ scaling law to the "central limit theorem" of lottery tickets. Finally, we discuss its potential implications for LLMs and statistical physics-type theories of learning.
- Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409, 2017.
- A constructive prediction of the generalization error across scales. arXiv preprint arXiv:1909.12673, 2019.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701, 2020.
- Data and parameter scaling laws for neural machine translation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5915–5922, 2021.
- Scaling vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12104–12113, 2022.
- Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
- The quantization model of neural scaling. arXiv preprint arXiv:2303.13506, 2023a.
- A neural scaling law from the dimension of the data manifold. arXiv preprint arXiv:2004.10802, 2020.
- Precision machine learning. Entropy, 25(1):175, 2023b.
- The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=rJl-b3RcF7.
- The depth-to-width interplay in self-attention. arXiv preprint arXiv:2006.12467, 2020.
- Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022.
- Are emergent abilities of large language models a mirage? arXiv preprint arXiv:2304.15004, 2023.
- On power laws in deep ensembles. Advances In Neural Information Processing Systems, 33:2375–2385, 2020.
- Redundant representations help generalization in wide neural networks. arXiv preprint arXiv:2106.03485, 2021.
- The hydra effect: Emergent self-repair in language model computations. arXiv preprint arXiv:2307.15771, 2023.
- The clock and the pizza: Two stories in mechanistic explanation of neural networks. arXiv preprint arXiv:2306.17844, 2023.
- Seeing is believing: Brain-inspired modular training for mechanistic interpretability. arXiv preprint arXiv:2305.08746, 2023.
- Neural networks and quantum field theory. Machine Learning: Science and Technology, 2(3):035002, 2021.
- The principles of deep learning theory. Cambridge University Press Cambridge, MA, USA, 2022.
- Parameters as interacting particles: long time convergence and asymptotic error scaling of neural networks. Advances in neural information processing systems, 31, 2018.
- Ziming Liu (87 papers)
- Max Tegmark (133 papers)