Sparse Universal Transformer (2310.07096v1)
Abstract: The Universal Transformer (UT) is a variant of the Transformer that shares parameters across its layers. Empirical evidence shows that UTs have better compositional generalization than Vanilla Transformers (VTs) in formal language tasks. The parameter-sharing also affords it better parameter efficiency than VTs. Despite its many advantages, scaling UT parameters is much more compute and memory intensive than scaling up a VT. This paper proposes the Sparse Universal Transformer (SUT), which leverages Sparse Mixture of Experts (SMoE) and a new stick-breaking-based dynamic halting mechanism to reduce UT's computation complexity while retaining its parameter efficiency and generalization ability. Experiments show that SUT achieves the same performance as strong baseline models while only using half computation and parameters on WMT'14 and strong generalization results on formal language tasks (Logical inference and CFQ). The new halting mechanism also enables around 50\% reduction in computation during inference with very little performance decrease on formal language tasks.
- Systematic generalization with edge transformers. Advances in Neural Information Processing Systems, 34:1390–1402.
- Findings of the 2014 workshop on statistical machine translation. In Proceedings of the ninth workshop on statistical machine translation, pages 12–58.
- Tree-structured composition in neural networks without tree-structured architectures. arXiv preprint arXiv:1506.04834.
- Mod-squad: Designing mixture of experts as modular multi-task learners. arXiv preprint arXiv:2212.08066.
- Kenneth Church and Patrick Hanks. 1990. Word association norms, mutual information, and lexicography. Computational linguistics, 16(1):22–29.
- The devil is in the detail: Simple tricks improve systematic generalization of transformers. arXiv preprint arXiv:2108.12284.
- Universal transformers. arXiv preprint arXiv:1807.03819.
- Neural networks and the chomsky hierarchy. arXiv preprint arXiv:2207.02098.
- Compositional semantic parsing with large language models. arXiv preprint arXiv:2209.15003.
- Faith and fate: Limits of transformers on compositionality. arXiv preprint arXiv:2305.18654.
- Depth-adaptive transformer. arXiv preprint arXiv:1910.10073.
- Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.
- Compositional generalization in semantic parsing: Pre-training vs. specialized architectures. arXiv preprint arXiv:2007.08970.
- Alex Graves. 2016. Adaptive computation time for recurrent neural networks. arXiv preprint arXiv:1603.08983.
- Michael Hahn. 2020. Theoretical limitations of self-attention in neural sequence models. Transactions of the Association for Computational Linguistics, 8:156–171.
- Formal language recognition by hard attention transformers: Perspectives from circuit complexity. Transactions of the Association for Computational Linguistics, 10:800–810.
- Compositionality decomposed: How do neural networks generalise? Journal of Artificial Intelligence Research, 67:757–795.
- Łukasz Kaiser and Ilya Sutskever. 2015. Neural gpus learn algorithms. arXiv preprint arXiv:1511.08228.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
- Measuring compositional generalization: A comprehensive method on realistic data. arXiv preprint arXiv:1912.09713.
- Yoon Kim. 2021. Sequence-to-sequence learning with latent neural grammars. Advances in Neural Information Processing Systems, 34:26302–26317.
- Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942.
- Compositional generalization for primitive substitutions. arXiv preprint arXiv:1910.02612.
- Transformers learn shortcuts to automata. arXiv preprint arXiv:2210.10749.
- Very deep transformers for neural machine translation. arXiv preprint arXiv:2008.07772.
- Saturated transformers are constant-depth threshold circuits. Transactions of the Association for Computational Linguistics, 10:843–856.
- Scaling neural machine translation. WMT, pages 1–9.
- fairseq: A fast, extensible toolkit for sequence modeling. arXiv preprint arXiv:1904.01038.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
- A mixture of h-1 heads is better than h heads. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6566–6577.
- Compositional generalization in a deep seq2seq model by separating syntax and semantics. arXiv preprint arXiv:1904.09708.
- Confident adaptive language modeling. arXiv preprint arXiv:2207.07061.
- Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538.
- Ordered memory. Advances in Neural Information Processing Systems, 32.
- Moduleformer: Learning modular large language models from uncurated data. arXiv preprint arXiv:2306.04640.
- Vsevolod Sourkov. 2018. Igloo: Slicing the features space to represent sequences. arXiv preprint arXiv:1807.03402.
- Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826.
- Sho Takase and Shun Kiyono. 2021. Lessons on parameter sharing across layers in transformers. arXiv preprint arXiv:2104.06022.
- Recursive top-down production for sentence generation with latent trees. arXiv preprint arXiv:2010.04704.
- Shawn Tan and Khe Chai Sim. 2016. Towards implicit complexity control using variable-depth deep neural networks for automatic speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5965–5969. IEEE.
- Scaling laws vs model architectures: How does inductive bias influence scaling? arXiv preprint arXiv:2207.10551.
- Long range arena: A benchmark for efficient transformers. arXiv preprint arXiv:2011.04006.
- The importance of being recurrent for modeling hierarchical structure. arXiv preprint arXiv:1803.03585.
- Attention is all you need. Advances in neural information processing systems, 30.
- Pay less attention with lightweight and dynamic convolutions. In International Conference on Learning Representations.
- Mixture of attention heads: Selecting attention heads per token. arXiv e-prints, pages arXiv–2210.
- Hao Zheng and Mirella Lapata. 2021. Disentangled sequence to sequence learning for compositional generalization. arXiv preprint arXiv:2110.04655.
- Shawn Tan (17 papers)
- Yikang Shen (62 papers)
- Zhenfang Chen (36 papers)
- Aaron Courville (201 papers)
- Chuang Gan (195 papers)