Papers
Topics
Authors
Recent
Search
2000 character limit reached

Dynamic Layer Tying for Parameter-Efficient Transformers

Published 23 Jan 2024 in cs.LG and cs.AI | (2401.12819v1)

Abstract: In the pursuit of reducing the number of trainable parameters in deep transformer networks, we employ Reinforcement Learning to dynamically select layers during training and tie them together. Every few iterations, the RL agent is asked whether to train each layer $i$ independently or to copy the weights of a previous layer $j<i$. This facilitates weight sharing, reduces the number of trainable parameters, and also serves as an effective regularization technique. Experimental evaluations validate that our model modestly outperforms the baseline transformer model with regard to perplexity and drastically reduces the number of trainable parameters. In particular, the memory consumption during training is up to one order of magnitude less than the conventional training method.

Authors (2)
Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. Designing neural network architectures using reinforcement learning. ICLR, 2017.
  2. Leveraging redundancy in attention with reuse transformers. arXiv preprint arXiv:2110.06821, 2021.
  3. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  4. Efficient neural architecture search via parameter sharing. ICLR, 2018.
  5. Progressive strategies for monte-carlo tree search. New Mathematics and Natural Computation, 4(03):343–357, 2008.
  6. The lottery ticket hypothesis for pre-trained BERT networks. arXiv preprint arXiv:2007.12223, 2020a.
  7. Earlybert: Efficient bert training via early-bird lottery tickets. arXiv preprint arXiv:2101.00063, 2020b.
  8. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
  9. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  10. Dependency parsing with backtracking using deep reinforcement learning. Transactions of the Association for Computational Linguistics, 10:888–903, 2022.
  11. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  12. Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping. arXiv preprint arXiv:2002.06305, 2020.
  13. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=YicbFdNTTy.
  14. Reducing transformer depth on demand with structured dropout. arXiv preprint arXiv:1909.11556, 2019.
  15. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635, 2018.
  16. The state of sparsity in deep neural networks. arXiv preprint arXiv:1902.09574, 2019.
  17. Graphnas: Graph neural architecture search with reinforcement learning. arXiv preprint arXiv:1904.09981, 2019.
  18. Cross-attention is all you need: Adapting pretrained transformers for machine translation. arXiv preprint arXiv:2104.08771, 2021.
  19. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
  20. Optimal brain surgeon and general network pruning. In IEEE international conference on neural networks, pp.  293–299. IEEE, 1993.
  21. Deberta: Decoding-enhanced bert with disentangled attention. In International Conference on Learning Representations, 2021.
  22. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pp.  2790–2799. PMLR, 2019.
  23. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2021.
  24. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9.
  25. Learning multiple layers of features from tiny images. 2009.
  26. The optimal bert surgeon: Scalable and accurate second-order pruning for large language models. arXiv preprint arXiv:2203.07259, 2022.
  27. A fast post-training pruning framework for transformers. Advances in Neural Information Processing Systems, 35:24101–24116, 2022.
  28. Darts: Differentiable architecture search. In International Conference on Learning Representations, 2018a.
  29. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  30. Rethinking the value of network pruning. arXiv preprint arXiv:1810.05270, 2018b.
  31. Luna: Linear unified nested attention. Advances in Neural Information Processing Systems, 34:2441–2453, 2021.
  32. Are sixteen heads really better than one? arXiv preprint arXiv:1905.10650, 2019.
  33. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
  34. Pruning convolutional neural networks for resource efficient inference. arXiv preprint arXiv:1611.06440, 2016.
  35. Dissecting lottery ticket transformers: Structural and behavioral study of sparse neural machine translation. In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pp.  193–203, 2020.
  36. When BERT plays the lottery, all tickets are winning. arXiv preprint arXiv:2005.00561, 2020.
  37. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021.
  38. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(140):1–67, 2020.
  39. Subformer: Exploring weight sharing for parameter efficiency in generative transformers. In Findings of the Association for Computational Linguistics: EMNLP 2021, pp.  4081–4090, 2021.
  40. Sebastian Ruder. An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747, 2016.
  41. On the effect of dropping layers of pre-trained transformer models. arXiv preprint arXiv:2004.03844, 2020.
  42. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  43. Learned learning rate schedules for deep neural network training using reinforcement learning, 2023. URL https://openreview.net/forum?id=0Zhwu1VaOs.
  44. A simple and effective pruning approach for large language models. arXiv preprint arXiv:2306.11695, 2023.
  45. Training neural networks with fixed sparse masks. Advances in Neural Information Processing Systems, 34:24193–24205, 2021.
  46. Reinforcement learning: An introduction. MIT press, 2018.
  47. Lessons on parameter sharing across layers in transformers. arXiv preprint arXiv:2104.06022, 2021.
  48. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  49. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  50. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. arXiv preprint arXiv:1905.09418, 2019.
  51. Efficient fine-tuning of bert models on the edge. In 2022 IEEE International Symposium on Circuits and Systems (ISCAS), pp.  1838–1842. IEEE, 2022.
  52. GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018.
  53. Sharing attention weights for fast transformer. arXiv preprint arXiv:1906.11024, 2019.
  54. Learning an adaptive learning rate schedule. arXiv preprint arXiv:1909.09712, 2019.
  55. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. arXiv preprint arXiv:2106.10199, 2021.
  56. Weixiong Zhang. Complete anytime beam search. In AAAI/IAAI, pp.  425–430, 1998.
  57. To prune, or not to prune: Exploring the efficacy of pruning for model compression. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Workshop Track Proceedings. OpenReview.net, 2018.
  58. Neural architecture search with reinforcement learning. In International Conference on Learning Representations, 2016.
  59. Neural architecture search with reinforcement learning. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=r1Ue8Hcxg.
Citations (2)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 3 likes about this paper.