Scavenging Hyena: Distilling Transformers into Long Convolution Models (2401.17574v1)
Abstract: The rapid evolution of LLMs, epitomized by architectures like GPT-4, has reshaped the landscape of natural language processing. This paper introduces a pioneering approach to address the efficiency concerns associated with LLM pre-training, proposing the use of knowledge distillation for cross-architecture transfer. Leveraging insights from the efficient Hyena mechanism, our method replaces attention heads in transformer models by Hyena, offering a cost-effective alternative to traditional pre-training while confronting the challenge of processing long contextual information, inherent in quadratic attention mechanisms. Unlike conventional compression-focused methods, our technique not only enhances inference speed but also surpasses pre-training in terms of both accuracy and efficiency. In the era of evolving LLMs, our work contributes to the pursuit of sustainable AI solutions, striking a balance between computational power and environmental impact.
- Pythia: A suite for analyzing large language models across training and scaling. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023.
- Gpt-neox-20b: An open-source autoregressive language model, 2022.
- The fast Fourier transform. IEEE Spectrum, 4(12):63–70, December 1967. ISSN 0018-9235. doi: 10.1109/MSPEC.1967.5217220.
- Longnet: Scaling transformers to 1,000,000,000 tokens, 2023.
- Hungry Hungry Hippos: Towards Language Modeling with State Space Models. 2022. doi: 10.48550/ARXIV.2212.14052.
- The pile: An 800gb dataset of diverse text for language modeling, 2020.
- A framework for few-shot language model evaluation. Version v0. 0.1. Sept, 2021.
- Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus, 2019.
- Distilling the knowledge in a neural network. ArXiv, abs/1503.02531, 2015. URL https://api.semanticscholar.org/CorpusID:7200347.
- Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. 2020. doi: 10.48550/ARXIV.2006.16236.
- Laughing Hyena Distillery: Extracting Compact Recurrences From Convolutions. 2023. doi: 10.48550/ARXIV.2310.18780.
- Pointer sentinel mixture models, 2016.
- OpenAI, Nov 2023. URL https://openai.com/blog/new-models-and-developer-products-announced-at-devday.
- Hyena hierarchy: Towards larger convolutional language models, 2023.
- Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter, 2020.
- Mobilebert: a compact task-agnostic bert for resource-limited devices, 2020.
- Attention is all you need. 2017. URL https://arxiv.org/pdf/1706.03762.pdf.
- Harnessing the power of llms in practice: A survey on chatgpt and beyond, 2023.
- An Attention Free Transformer. 2021. doi: 10.48550/ARXIV.2105.14103.
- Tokiniaina Raharison Ralambomihanta (2 papers)
- Shahrad Mohammadzadeh (3 papers)
- Mohammad Sami Nur Islam (1 paper)
- Wassim Jabbour (3 papers)
- Laurence Liang (3 papers)