LoGAH: Predicting 774-Million-Parameter Transformers using Graph HyperNetworks with 1/100 Parameters (2405.16287v1)

Published 25 May 2024 in cs.LG

Abstract: A good initialization of deep learning models is essential since it can help them converge better and faster. However, pretraining large models is unaffordable for many researchers, which makes a desired prediction for initial parameters more necessary nowadays. Graph HyperNetworks (GHNs), one approach to predicting model parameters, have recently shown strong performance in initializing large vision models. Unfortunately, predicting parameters of very wide networks relies on copying small chunks of parameters multiple times and requires an extremely large number of parameters to support full prediction, which greatly hinders its adoption in practice. To address this limitation, we propose LoGAH (Low-rank GrAph Hypernetworks), a GHN with a low-rank parameter decoder that expands to significantly wider networks without requiring as excessive increase of parameters as in previous attempts. LoGAH allows us to predict the parameters of 774-million large neural networks in a memory-efficient manner. We show that vision and LLMs (i.e., ViT and GPT-2) initialized with LoGAH achieve better performance than those initialized randomly or using existing hypernetworks. Furthermore, we show promising transfer learning results w.r.t. training LoGAH on small datasets and using the predicted parameters to initialize for larger tasks. We provide the codes in https://github.com/Blackzxy/LoGAH .

References (45)

Summary

The paper presents a novel graph hypernetwork, LoGAH, that efficiently predicts a 774-million-parameter Transformer using only 1/100 of the typical parameters.
It employs a low-rank decoder to reduce parameter growth from O(d^3) to O(d^2), enabling scalable initialization across both vision and language models.
Experimental results show improved transfer learning performance on CIFAR-10, CIFAR-100, and ImageNet, with diverse parameter predictions that enhance model convergence.

Overview of LoGAH: Predicting Large-Scale Transformer Parameters Efficiently

The paper "LoGAH: Predicting 774-Million-Parameter Transformers using Graph HyperNetworks with 1/100 Parameters" (2405.16287) focuses on addressing the challenges associated with initializing large-scale neural networks, particularly Transformers such as Vision Transformers (ViTs) and GPT-2 models. This paper introduces LoGAH, a novel Graph HyperNetwork that efficiently predicts parameters with significantly reduced computational requirements compared to traditional methods.

Introduction and Problem Statement

Large-scale models, including ViTs and GPTs, have traditionally required extensive computational resources for training, due to their massive size, ranging from hundreds of millions to billions of parameters. Pretraining these models from scratch is costly and often impractical for many researchers and institutions. Current Graph HyperNetworks (GHNs) can predict parameters for neural architectures, but they must replicate small parameter chunks multiple times to support full prediction, necessitating large hidden sizes and growing the parameter count exponentially.

The LoGAH model solves these problems by introducing a low-rank parameter decoder that supports wider networks without the need for extensive parameters. This approach is especially beneficial in predicting the parameters of models like GPT-2-Large, with 774 million parameters, using only a small fraction of the typical parameter count.

Methodology: LoGAH's Low-Rank Decoder

LoGAH employs a low-rank parameter decoder, reducing the parameter growth from $\mathcal{O}(d^3)$ to $\mathcal{O}(d^2)$ , where $d$ is the hidden size. This efficiency is achieved through a low-rank decomposition approach inspired by other computational models, which allows the model to predict larger shaped parameters more effectively. As a result, LoGAH can support the full range of parameter prediction without excessive parameter repetition, proving scalability and generalization on much larger networks.

This significant reduction in parameters enables LoGAH to generalize hypernetwork capabilities across various model scales, from tiny to large architectures, while offering promising results for both vision and language tasks.

Figure 1: Comparison of parameter counts between GHN-3 and LoGAH. GHN-3 requires a larger hidden size to support wider networks, which increases the size of GHN-3 exponentially in Figure 1.

Experimental Results

Vision and LLMs

LoGAH demonstrates robust performance in initializing large models, outperforming random and existing hypernetwork initializations. The experiments conducted on CIFAR-10, CIFAR-100, and ImageNet tasks show that models initialized with LoGAH perform better than those initialized with random methods or GHN-3.

Figure 2: ViT transfer learning experiments. We use LoGAH trained on CIFAR-10 (resp. CIFAR-100) to predict ViT's parameters, then ViT is trained on CIFAR-100 (resp. ImageNet). T, S, B and L denotes Tiny, Small, Base and Large versions of LoGAH respectively.

The experiments extend to transfer learning scenarios, showcasing LoGAH’s capability to predict parameters that adapt well to harder datasets, warranting improved model performance when transitioning across different data distributions, including those significantly larger than the training set. These results suggest that LoGAH’s approach to parameter prediction allows for scalable learning with reduced pretraining costs.

Parameter Diversity

The paper highlights the diversity of parameters predicted by LoGAH, emphasizing its ability to generate varied initialization states that enhance model convergence on downstream tasks. This diversity is particularly notable in challenging tasks where traditional random initialization may fall short.

Theoretical and Practical Implications

LoGAH's contributions lie in offering a scalable solution to initializing very large models efficiently, with practical implications for reducing computational expenditure and speeding up the pretraining process in both vision and LLMs. The model supports wider networks without necessitating exponential parameter growth, suggesting potential future developments in AI, where the scale can be increased without proportionally increasing computational costs.

Conclusion

LoGAH presents a significant advancement in hypernetwork-based initialization, facilitating efficient and scalable pretraining of large models like ViTs and GPTs. By reducing the parameters required for predicting Transformer models, the approach opens new avenues for research into hypernetwork applications and large-scale model adaptations, expediting the deployment of sophisticated AI systems with more accessible resources.