TCNCA: Temporal Convolution Network with Chunked Attention for Scalable Sequence Processing (2312.05605v1)
Abstract: MEGA is a recent transformer-based architecture, which utilizes a linear recurrent operator whose parallel computation, based on the FFT, scales as $O(LlogL)$, with $L$ being the sequence length. We build upon their approach by replacing the linear recurrence with a special temporal convolutional network which permits larger receptive field size with shallower networks, and reduces the computational complexity to $O(L)$. The resulting model is called TCNCA, a Temporal Convolutional Network with Chunked Attention. We evaluate TCNCA on EnWik8 LLMing, long-range-arena (LRA) sequence classification, as well as a synthetic reasoning benchmark associative recall. On EnWik8, TCNCA outperforms MEGA, reaching a lower loss with $1.37\times$/$1.24\times$ faster forward/backward pass during training. The dilated convolutions used in TCNCA are consistently and significantly faster operations than the FFT-based parallelized recurrence in GPUs, making them a scalable candidate for handling very large sequence lengths: they are up to $7.07\times$/$2.86\times$ faster in the forward/backward pass for sequences up to 131k. Further on LRA, TCNCA achieves, on average, $1.28\times$ speed-up during inference with similar accuracy to what MEGA achieves. On associative recall, we find that even a simplified version of TCNCA, without excessive multiplicative and additive interactions, remains superior or competitive to MEGA on a range of sequence lengths and vocabulary sizes.
- Attention is all you need. Advances in Neural Information Processing Systems (NeurIPS), 30, 2017.
- An image is worth 16x16 words: Transformers for image recognition at scale. International Conference on Learning Representations (ICLR), 2021.
- Transformers for modeling physical systems. Neural Networks, 146:272–289, 2022.
- Daria Grechishnikova. Transformer neural network for protein-specific de novo drug generation as a machine translation problem. Scientific reports, 11(1):321, 2021.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (ACL): Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, 2019.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
- Language models are few-shot learners. Advances in Neural Information Processing Systems (NeurIPS), 33:1877–1901, 2020.
- Efficient transformers: A survey. ACM Computing Surveys, 55:1 – 28, 2020.
- Long range arena : A benchmark for efficient transformers. In International Conference on Learning Representations (ICLR), 2021.
- Efficiently modeling long sequences with structured state spaces. In International Conference on Learning Representations (ICLR), 2022a.
- Resurrecting recurrent neural networks for long sequences. In International Conference on Machine Learning (ICML), 2023.
- Simplified state space layers for sequence modeling. In International Conference on Learning Representations (ICLR), 2023.
- Hungry Hungry Hippos: Towards language modeling with state space models. In International Conference on Learning Representations (ICLR), 2023.
- Mega: Moving average equipped gated attention. In International Conference on Learning Representations (ICLR), 2023.
- Damped trend exponential smoothing: A modelling viewpoint. International Journal of Forecasting, 26(4):661–665, 2010.
- Marcus Hutter. The human knowledge compression contest, 2006.
- Parallelizing legendre memory unit training. In International Conference on Machine Learning (ICML), pages 1898–1907, 2021.
- An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv:1803.01271, 2018.
- Neural machine translation in linear time. arXiv preprint arXiv:1610.10099, 2016.
- Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
- Eeg-tcnet: An accurate temporal convolutional network for embedded motor-imagery brain–machine interfaces. In 2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pages 2958–2965, 2020.
- Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), pages 2978–2988, 2019.
- On the parameterization and initialization of diagonal state space models. Advances in Neural Information Processing Systems (NeurIPS), 35:35971–35983, 2022b.
- What makes convolutional models great on long sequence modeling? In International Conference on Learning Representations (ICLR), 2023.
- Using fast weights to attend to the recent past. Advances in Neural Information Processing Systems (NeurIPS), 29, 2016.
- Hippo: Recurrent memory with optimal polynomial projections. Advances in Neural Information Processing Systems (NeurIPS), 33:1474–1487, 2020.
- Long range language modeling via gated state spaces. arXiv preprint arXiv:2206.13947, 2022.
- In-context learning and induction heads. arXiv preprint arXiv:2209.11895, 2022.
- Hyena hierarchy: Towards larger convolutional language models. In International Conference on Machine Learning (ICML), 2023.
- Simple hardware-efficient long convolutions for sequence modeling. In ICLR 2023 Workshop on Mathematical and Empirical Understanding of Foundation Models, 2023.
- Classification of long sequential data using circular dilated convolutional neural networks. Neurocomputing, 2022.
- Temporal convolutional attention-based network for sequence modeling. arXiv preprint arXiv:2002.12530, 2020.
- Video2mesh: 3d human pose and shape recovery by a temporal convolutional transformer network. IET Computer Vision, 17, 02 2023.
- Parallelizing linear recurrent neural nets over sequence length. International Conference on Learning Representations (ICLR), 2018.
- Training very deep networks. Advances in Neural Information Processing Systems (NeurIPS), 28, 2015.
- Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021.