Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
121 tokens/sec
GPT-4o
9 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Cross-Architecture Transfer Learning for Linear-Cost Inference Transformers (2404.02684v1)

Published 3 Apr 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Recently, multiple architectures has been proposed to improve the efficiency of the Transformer LLMs through changing the design of the self-attention block to have a linear-cost inference (LCI). A notable approach in this realm is the State-Space Machines (SSMs) architecture, which showed on-par performance on LLMing tasks with the self-attention transformers. However, such an architectural change requires a full pretraining of the weights from scratch, which incurs a huge cost to researchers and practitioners who want to use the new architectures. In the more traditional linear attention works, it has been proposed to approximate full attention with linear attention by swap-and-finetune framework. Motivated by this approach, we propose Cross-Architecture Transfer Learning (XATL), in which the weights of the shared components between LCI and self-attention-based transformers, such as layernorms, MLPs, input/output embeddings, are directly transferred to the new architecture from already pre-trained model parameters. We experimented the efficacy of the method on varying sizes and alternative attention architectures and show that \methodabbr significantly reduces the training time up to 2.5x times and converges to a better minimum with up to 2.6% stronger model on the LM benchmarks within the same compute budget.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. Introducing chatgpt. https://openai.com/blog/chatgpt, 2022.
  2. Lightning AI. Litgpt. https://github.com/Lightning-AI/litgpt, 2023.
  3. Falcon-40B: an open large language model with state-of-the-art performance. 2023.
  4. Anthropic. Claude 2. https://www.anthropic.com/index/claude-2, 2023.
  5. Zoology: Measuring and improving recall in efficient language models, 2023.
  6. Simple linear attention language models balance the recall-throughput tradeoff, 2024.
  7. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pp.  2397–2430. PMLR, 2023.
  8. Piqa: Reasoning about physical commonsense in natural language, 2019.
  9. Gpt-neox-20b: An open-source autoregressive language model, 2022.
  10. Palm: Scaling language modeling with pathways, 2022.
  11. Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning, 2023.
  12. William Falcon and The PyTorch Lightning team. PyTorch Lightning, March 2019. URL https://github.com/Lightning-AI/lightning.
  13. Hungry hungry hippos: Towards language modeling with state space models. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=COZDy0WYGg.
  14. The Pile: An 800GB dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  15. Transformer feed-forward layers are key-value memories. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  5484–5495, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.446. URL https://aclanthology.org/2021.emnlp-main.446.
  16. Mamba: Linear-time sequence modeling with selective state spaces, 2023.
  17. Mistral 7b, 2023.
  18. Finetuning pretrained transformers into RNNs. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  10630–10643, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.830. URL https://aclanthology.org/2021.emnlp-main.830.
  19. Transformers are rnns: Fast autoregressive transformers with linear attention. In Proceedings of the International Conference on Machine Learning (ICML), 2020.
  20. Reformer: The efficient transformer. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=rkgNKkHtvB.
  21. OpenAI. Gpt-4 technical report, 2023.
  22. The lambada dataset: Word prediction requiring a broad discourse context, 2016.
  23. Rwkv: Reinventing rnns for the transformer era, 2023.
  24. Hyena hierarchy: Towards larger convolutional language models, 2023.
  25. The devil in linear transformer. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  7025–7041, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.473. URL https://aclanthology.org/2022.emnlp-main.473.
  26. Language models are unsupervised multitask learners. 2019.
  27. Winogrande: An adversarial winograd schema challenge at scale, 2019.
  28. Weight subcloning: direct initialization of transformers using larger pretrained ones, 2023.
  29. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter, 2020.
  30. Retentive network: A successor to transformer for large language models, 2023.
  31. Llama: Open and efficient foundation language models, 2023a.
  32. Llama 2: Open foundation and fine-tuned chat models, 2023b.
  33. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
  34. Linformer: Self-attention with linear complexity, 2020.
  35. Quick and (not so) dirty: Unsupervised selection of justification sentences for multi-hop question answering. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  2578–2589, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1260. URL https://aclanthology.org/D19-1260.
  36. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019.
  37. An attention free transformer, 2021.
  38. The hedgehog & the porcupine: Expressive linear attentions with softmax mimicry, 2024.
Citations (2)

Summary

We haven't generated a summary for this paper yet.