Papers
Topics
Authors
Recent
2000 character limit reached

LoGAH: Predicting 774-Million-Parameter Transformers using Graph HyperNetworks with 1/100 Parameters (2405.16287v1)

Published 25 May 2024 in cs.LG

Abstract: A good initialization of deep learning models is essential since it can help them converge better and faster. However, pretraining large models is unaffordable for many researchers, which makes a desired prediction for initial parameters more necessary nowadays. Graph HyperNetworks (GHNs), one approach to predicting model parameters, have recently shown strong performance in initializing large vision models. Unfortunately, predicting parameters of very wide networks relies on copying small chunks of parameters multiple times and requires an extremely large number of parameters to support full prediction, which greatly hinders its adoption in practice. To address this limitation, we propose LoGAH (Low-rank GrAph Hypernetworks), a GHN with a low-rank parameter decoder that expands to significantly wider networks without requiring as excessive increase of parameters as in previous attempts. LoGAH allows us to predict the parameters of 774-million large neural networks in a memory-efficient manner. We show that vision and LLMs (i.e., ViT and GPT-2) initialized with LoGAH achieve better performance than those initialized randomly or using existing hypernetworks. Furthermore, we show promising transfer learning results w.r.t. training LoGAH on small datasets and using the predicted parameters to initialize for larger tasks. We provide the codes in https://github.com/Blackzxy/LoGAH .

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. Masked autoencoders are scalable vision learners, 2021.
  2. BERT: Pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.
  3. Improving language understanding by generative pre-training. 2018.
  4. Llama 2: Open foundation and fine-tuned chat models, 2023.
  5. AI@Meta. Llama 3 model card. 2024. URL https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md.
  6. An image is worth 16x16 words: Transformers for image recognition at scale, 2021.
  7. Scaling vision transformers to 22 billion parameters. In International Conference on Machine Learning, pages 7480–7512. PMLR, 2023.
  8. The computational limits of deep learning, 2022.
  9. Scaling vision transformers, 2022.
  10. Attention is all you need, 2023.
  11. Neocognitron: A neural network model for a mechanism of visual pattern recognition. IEEE transactions on systems, man, and cybernetics, (5):826–834, 1983.
  12. Imagenet large scale visual recognition challenge. International journal of computer vision, 115:211–252, 2015.
  13. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  14. Graph hypernetworks for neural architecture search. arXiv preprint arXiv:1810.05749, 2018.
  15. Parameter prediction for unseen deep architectures, 2021.
  16. Can we scale transformers to predict parameters of diverse imagenet models?, 2023.
  17. Graph hypernetworks for neural architecture search, 2020.
  18. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  19. Do transformers really perform bad for graph representation?, 2021.
  20. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  21. Learning multiple layers of features from tiny images. 2009.
  22. Pointer sentinel mixture models, 2016.
  23. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks, 2014.
  24. Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus, 2019.
  25. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019.
  26. Zero: Memory optimizations toward training trillion parameter models, 2020.
  27. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification, 2015.
  28. A survey of knowledge-intensive nlp with pre-trained language models, 2022.
  29. Threats to pre-trained language models: Survey and taxonomy, 2022.
  30. Language models are few-shot learners, 2020.
  31. End-to-end object detection with transformers, 2020.
  32. Generative pretraining from pixels. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 1691–1703. PMLR, 13–18 Jul 2020. URL https://proceedings.mlr.press/v119/chen20s.html.
  33. Hypernetworks, 2016.
  34. Hyperseg: Patch-wise hypernetwork for real-time semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4061–4070, June 2021.
  35. Fast and flexible multi-task classification using conditional neural adaptive processes, 2020.
  36. Task-adaptive neural process for user cold-start recommendation. In Proceedings of the Web Conference 2021, WWW ’21, page 1306–1316, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450383127. doi: 10.1145/3442381.3449908. URL https://doi.org/10.1145/3442381.3449908.
  37. Hypertransformer: Model generation for supervised and semi-supervised few-shot learning, 2022.
  38. General-purpose in-context learning by meta-learning transformers, 2024.
  39. Generating interpretable networks using hypernetworks. arXiv preprint arXiv:2312.03051, 2023.
  40. Metainit: Initializing learning by learning to initialize. Advances in Neural Information Processing Systems, 32, 2019.
  41. Towards theoretically inspired neural initialization optimization. Advances in Neural Information Processing Systems, 35:18983–18995, 2022.
  42. Gradmax: Growing neural networks using gradient information. arXiv preprint arXiv:2201.05125, 2022.
  43. Learning to grow pretrained models for efficient transformer training. arXiv preprint arXiv:2303.00980, 2023.
  44. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
  45. Kunihiko Fukushima. Cognitron: A self-organizing multilayered neural network. Biological cybernetics, 20(3-4):121–136, 1975.

Summary

  • The paper presents a novel graph hypernetwork, LoGAH, that efficiently predicts a 774-million-parameter Transformer using only 1/100 of the typical parameters.
  • It employs a low-rank decoder to reduce parameter growth from O(d^3) to O(d^2), enabling scalable initialization across both vision and language models.
  • Experimental results show improved transfer learning performance on CIFAR-10, CIFAR-100, and ImageNet, with diverse parameter predictions that enhance model convergence.

Overview of LoGAH: Predicting Large-Scale Transformer Parameters Efficiently

The paper "LoGAH: Predicting 774-Million-Parameter Transformers using Graph HyperNetworks with 1/100 Parameters" (2405.16287) focuses on addressing the challenges associated with initializing large-scale neural networks, particularly Transformers such as Vision Transformers (ViTs) and GPT-2 models. This paper introduces LoGAH, a novel Graph HyperNetwork that efficiently predicts parameters with significantly reduced computational requirements compared to traditional methods.

Introduction and Problem Statement

Large-scale models, including ViTs and GPTs, have traditionally required extensive computational resources for training, due to their massive size, ranging from hundreds of millions to billions of parameters. Pretraining these models from scratch is costly and often impractical for many researchers and institutions. Current Graph HyperNetworks (GHNs) can predict parameters for neural architectures, but they must replicate small parameter chunks multiple times to support full prediction, necessitating large hidden sizes and growing the parameter count exponentially.

The LoGAH model solves these problems by introducing a low-rank parameter decoder that supports wider networks without the need for extensive parameters. This approach is especially beneficial in predicting the parameters of models like GPT-2-Large, with 774 million parameters, using only a small fraction of the typical parameter count.

Methodology: LoGAH's Low-Rank Decoder

LoGAH employs a low-rank parameter decoder, reducing the parameter growth from O(d3)\mathcal{O}(d^3) to O(d2)\mathcal{O}(d^2), where dd is the hidden size. This efficiency is achieved through a low-rank decomposition approach inspired by other computational models, which allows the model to predict larger shaped parameters more effectively. As a result, LoGAH can support the full range of parameter prediction without excessive parameter repetition, proving scalability and generalization on much larger networks.

This significant reduction in parameters enables LoGAH to generalize hypernetwork capabilities across various model scales, from tiny to large architectures, while offering promising results for both vision and language tasks. Figure 1

Figure 1

Figure 1: Comparison of parameter counts between GHN-3 and LoGAH. GHN-3 requires a larger hidden size to support wider networks, which increases the size of GHN-3 exponentially in Figure 1.

Experimental Results

Vision and LLMs

LoGAH demonstrates robust performance in initializing large models, outperforming random and existing hypernetwork initializations. The experiments conducted on CIFAR-10, CIFAR-100, and ImageNet tasks show that models initialized with LoGAH perform better than those initialized with random methods or GHN-3. Figure 2

Figure 2: ViT transfer learning experiments. We use LoGAH trained on CIFAR-10 (resp. CIFAR-100) to predict ViT's parameters, then ViT is trained on CIFAR-100 (resp. ImageNet). T, S, B and L denotes Tiny, Small, Base and Large versions of LoGAH respectively.

The experiments extend to transfer learning scenarios, showcasing LoGAH’s capability to predict parameters that adapt well to harder datasets, warranting improved model performance when transitioning across different data distributions, including those significantly larger than the training set. These results suggest that LoGAH’s approach to parameter prediction allows for scalable learning with reduced pretraining costs.

Parameter Diversity

The paper highlights the diversity of parameters predicted by LoGAH, emphasizing its ability to generate varied initialization states that enhance model convergence on downstream tasks. This diversity is particularly notable in challenging tasks where traditional random initialization may fall short.

Theoretical and Practical Implications

LoGAH's contributions lie in offering a scalable solution to initializing very large models efficiently, with practical implications for reducing computational expenditure and speeding up the pretraining process in both vision and LLMs. The model supports wider networks without necessitating exponential parameter growth, suggesting potential future developments in AI, where the scale can be increased without proportionally increasing computational costs.

Conclusion

LoGAH presents a significant advancement in hypernetwork-based initialization, facilitating efficient and scalable pretraining of large models like ViTs and GPTs. By reducing the parameters required for predicting Transformer models, the approach opens new avenues for research into hypernetwork applications and large-scale model adaptations, expediting the deployment of sophisticated AI systems with more accessible resources.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

Sign up for free to view the 3 tweets with 75 likes about this paper.