Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TCNCA: Temporal Convolution Network with Chunked Attention for Scalable Sequence Processing (2312.05605v1)

Published 9 Dec 2023 in cs.LG and cs.CV

Abstract: MEGA is a recent transformer-based architecture, which utilizes a linear recurrent operator whose parallel computation, based on the FFT, scales as $O(LlogL)$, with $L$ being the sequence length. We build upon their approach by replacing the linear recurrence with a special temporal convolutional network which permits larger receptive field size with shallower networks, and reduces the computational complexity to $O(L)$. The resulting model is called TCNCA, a Temporal Convolutional Network with Chunked Attention. We evaluate TCNCA on EnWik8 LLMing, long-range-arena (LRA) sequence classification, as well as a synthetic reasoning benchmark associative recall. On EnWik8, TCNCA outperforms MEGA, reaching a lower loss with $1.37\times$/$1.24\times$ faster forward/backward pass during training. The dilated convolutions used in TCNCA are consistently and significantly faster operations than the FFT-based parallelized recurrence in GPUs, making them a scalable candidate for handling very large sequence lengths: they are up to $7.07\times$/$2.86\times$ faster in the forward/backward pass for sequences up to 131k. Further on LRA, TCNCA achieves, on average, $1.28\times$ speed-up during inference with similar accuracy to what MEGA achieves. On associative recall, we find that even a simplified version of TCNCA, without excessive multiplicative and additive interactions, remains superior or competitive to MEGA on a range of sequence lengths and vocabulary sizes.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Attention is all you need. Advances in Neural Information Processing Systems (NeurIPS), 30, 2017.
  2. An image is worth 16x16 words: Transformers for image recognition at scale. International Conference on Learning Representations (ICLR), 2021.
  3. Transformers for modeling physical systems. Neural Networks, 146:272–289, 2022.
  4. Daria Grechishnikova. Transformer neural network for protein-specific de novo drug generation as a machine translation problem. Scientific reports, 11(1):321, 2021.
  5. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (ACL): Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, 2019.
  6. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  7. Language models are few-shot learners. Advances in Neural Information Processing Systems (NeurIPS), 33:1877–1901, 2020.
  8. Efficient transformers: A survey. ACM Computing Surveys, 55:1 – 28, 2020.
  9. Long range arena : A benchmark for efficient transformers. In International Conference on Learning Representations (ICLR), 2021.
  10. Efficiently modeling long sequences with structured state spaces. In International Conference on Learning Representations (ICLR), 2022a.
  11. Resurrecting recurrent neural networks for long sequences. In International Conference on Machine Learning (ICML), 2023.
  12. Simplified state space layers for sequence modeling. In International Conference on Learning Representations (ICLR), 2023.
  13. Hungry Hungry Hippos: Towards language modeling with state space models. In International Conference on Learning Representations (ICLR), 2023.
  14. Mega: Moving average equipped gated attention. In International Conference on Learning Representations (ICLR), 2023.
  15. Damped trend exponential smoothing: A modelling viewpoint. International Journal of Forecasting, 26(4):661–665, 2010.
  16. Marcus Hutter. The human knowledge compression contest, 2006.
  17. Parallelizing legendre memory unit training. In International Conference on Machine Learning (ICML), pages 1898–1907, 2021.
  18. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv:1803.01271, 2018.
  19. Neural machine translation in linear time. arXiv preprint arXiv:1610.10099, 2016.
  20. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
  21. Eeg-tcnet: An accurate temporal convolutional network for embedded motor-imagery brain–machine interfaces. In 2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pages 2958–2965, 2020.
  22. Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), pages 2978–2988, 2019.
  23. On the parameterization and initialization of diagonal state space models. Advances in Neural Information Processing Systems (NeurIPS), 35:35971–35983, 2022b.
  24. What makes convolutional models great on long sequence modeling? In International Conference on Learning Representations (ICLR), 2023.
  25. Using fast weights to attend to the recent past. Advances in Neural Information Processing Systems (NeurIPS), 29, 2016.
  26. Hippo: Recurrent memory with optimal polynomial projections. Advances in Neural Information Processing Systems (NeurIPS), 33:1474–1487, 2020.
  27. Long range language modeling via gated state spaces. arXiv preprint arXiv:2206.13947, 2022.
  28. In-context learning and induction heads. arXiv preprint arXiv:2209.11895, 2022.
  29. Hyena hierarchy: Towards larger convolutional language models. In International Conference on Machine Learning (ICML), 2023.
  30. Simple hardware-efficient long convolutions for sequence modeling. In ICLR 2023 Workshop on Mathematical and Empirical Understanding of Foundation Models, 2023.
  31. Classification of long sequential data using circular dilated convolutional neural networks. Neurocomputing, 2022.
  32. Temporal convolutional attention-based network for sequence modeling. arXiv preprint arXiv:2002.12530, 2020.
  33. Video2mesh: 3d human pose and shape recovery by a temporal convolutional transformer network. IET Computer Vision, 17, 02 2023.
  34. Parallelizing linear recurrent neural nets over sequence length. International Conference on Learning Representations (ICLR), 2018.
  35. Training very deep networks. Advances in Neural Information Processing Systems (NeurIPS), 28, 2015.
  36. Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021.
Citations (1)

Summary

We haven't generated a summary for this paper yet.