Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
131 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Continual Transformers: Redundancy-Free Attention for Online Inference (2201.06268v3)

Published 17 Jan 2022 in cs.AI and cs.CV

Abstract: Transformers in their common form are inherently limited to operate on whole token sequences rather than on one token at a time. Consequently, their use during online inference on time-series data entails considerable redundancy due to the overlap in successive token sequences. In this work, we propose novel formulations of the Scaled Dot-Product Attention, which enable Transformers to perform efficient online token-by-token inference on a continual input stream. Importantly, our modifications are purely to the order of computations, while the outputs and learned weights are identical to those of the original Transformer Encoder. We validate our Continual Transformer Encoder with experiments on the THUMOS14, TVSeries and GTZAN datasets with remarkable results: Our Continual one- and two-block architectures reduce the floating point operations per prediction by up to 63x and 2.6x, respectively, while retaining predictive performance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (60)
  1. Vivit: A video vision transformer. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  2. Single-layer vision transformers for more accurate early exits with less overhead. preprint, arXiv:2105.09121, 2021a.
  3. Multi-exit vision transformer for dynamic inference. British Machine Vision Conference (BMVC), 2021b.
  4. Longformer: The long-document transformer. arXiv:2004.05150, 2020.
  5. End-to-end object detection with transformers. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (eds.), European Conference on Computer Vision (ECCV), pp.  213–229, 2020.
  6. J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  4724–4733, 2017.
  7. Automatic tagging using deep convolutional neural networks. International Society of Music Information Retrieval Conference (ISMIR), 2016.
  8. Rethinking attention with performers. In International Conference on Learning Representations (ICLR), 2021.
  9. Self-attention fusion for audiovisual emotion recognition with incomplete data. arXiv preprint arXiv:2201.11095, 2022.
  10. MMAction2 Contributors. Openmmlab’s next generation video understanding toolbox and benchmark. https://github.com/open-mmlab/mmaction2. Apache 2.0 License., 2020.
  11. Smart: Training shallow memory-aware transformers for robotic explainability. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pp.  1128–1134, 2020. doi: 10.1109/ICRA40945.2020.9196653.
  12. Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  2978–2988. Association for Computational Linguistics, July 2019. doi: 10.18653/v1/P19-1285.
  13. Online action detection. In European Conference on Computer Vision (ECCV), pp. 269–284, 2016.
  14. Universal transformers. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=HyzdRiR9Y7.
  15. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  4171–4186. Association for Computational Linguistics, June 2019. doi: 10.18653/v1/N19-1423.
  16. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR), 2021.
  17. Learning to discriminate information for online action detection. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  806–815, 2020.
  18. Temporal filtering networks for online action detection. Pattern Recognition, 111:107695, 2021.
  19. RED: reinforced encoder-decoder networks for action anticipation. In British Machine Vision Conference (BMVC), 2017.
  20. AST: Audio Spectrogram Transformer. In Proc. Interspeech 2021, pp.  571–575, 2021. doi: 10.21437/Interspeech.2021-698.
  21. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, June 2016. doi: 10.1109/cvpr.2016.90.
  22. Continual 3d convolutional neural networks for real-time processing of videos. In European Conference on Computer Vision (ECCV), 2021. Apache 2.0 Licence.
  23. Online skeleton-based action recognition with continual spatio-temporal graph convolutional networks. preprint, arXiv: 2203.11009, 2022. Apache 2.0 Licence.
  24. Progressive spatio-temporal graph convolutional network for skeleton-based human action recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  3220–3224, 2021.
  25. Activitynet: A large-scale video benchmark for human activity understanding. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.  961–970, 2015. doi: 10.1109/CVPR.2015.7298698.
  26. CNN architectures for large-scale audio classification. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, March 2017. doi: 10.1109/icassp.2017.7952132.
  27. Music transformer: Generating music with long-term structure. In International Conference on Learning Representations (ICLR), 2019.
  28. The thumos challenge on action recognition for videos “in the wild”. Computer Vision and Image Understanding, 155:1–23, 2017. ISSN 1077-3142.
  29. 3d convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1):221–231, 2013. doi: 10.1109/TPAMI.2012.59.
  30. Temporally smooth online action detection using cycle-consistent future anticipation. Pattern Recognition, 116:107954, 2021. ISSN 0031-3203.
  31. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun (eds.), International Conference on Learning Representations (ICLR), 2015.
  32. Reformer: The efficient transformer. In International Conference on Learning Representations (ICLR), 2020.
  33. SHAPE: Shifted absolute position embedding for transformers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  3309–3321, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.266.
  34. Raw waveform-based audio classification using sample-level cnn architectures. NIPS, Machine Learning for Audio Signal Processing Workshop (ML4Audio), 2017.
  35. Cape: Encoding relative positions with continuous augmented positional embeddings. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp.  16079–16092. Curran Associates, Inc., 2021.
  36. Decoupled weight decay regularization. In International Conference on Learning Representations (ICLR), 2019.
  37. Learning to iteratively solve routing problems with dual-aspect collaborative transformer. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, 2021.
  38. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pp.  48–53, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-4009.
  39. Rethinking cnn models for audio classification. arXiv:2007.11154, 2020.
  40. Image transformer. In Jennifer Dy and Andreas Krause (eds.), International Conference on Machine Learning (ICML), volume 80 of Proceedings of Machine Learning Research, pp.  4055–4064. PMLR, 7 2018.
  41. Self-attention with relative position representations. In North American Chapter of the Association for Computational Linguistics (NAACL), 2018.
  42. Temporal action localization in untrimmed videos via multi-stage cnns. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.  1049–1058, 2016. doi: 10.1109/CVPR.2016.119.
  43. Cdc: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.  1417–1426, 2017. doi: 10.1109/CVPR.2017.155.
  44. A scale for the measurement of the psychological magnitude pitch. The Journal of the Acoustical Society of America, 8(3):185–190, 1937. doi: 10.1121/1.1915893.
  45. Efficient transformers: A survey. arXiv:2009.06732, 2020.
  46. Augmenting convolutional networks with attention-based aggregation. preprint, arXiv:2112.13692, abs/2112.13692, 2021.
  47. G. Tzanetakis and P. Cook. Musical genre classification of audio signals. IEEE Transactions on Speech and Audio Processing, 10(5):293–302, July 2002. doi: 10.1109/tsa.2002.800560.
  48. Automatic musical genre classification of audio signals, 2001. URL http://ismir2001.ismir.net/pdf/tzanetakis.pdf.
  49. Long-term temporal convolutions for action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(6):1510–1517, 2018. doi: 10.1109/TPAMI.2017.2712608.
  50. Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS), volume 30, pp.  5998–6008, 2017.
  51. Temporal segment networks for action recognition in videos. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(11):2740–2755, 2019. doi: 10.1109/TPAMI.2018.2868668.
  52. Linformer: Self-attention with linear complexity. arXiv:2006.04768, 2020.
  53. Oadtr: Online action detection with transformers. International Conference on Computer Vision (ICCV), 2021. URL https://github.com/wangxiang1230/OadTR. MIT License.
  54. Long-term feature banks for detailed video understanding. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  284–293, 2019. doi: 10.1109/CVPR.2019.00037.
  55. Nyströmformer: A nyström-based algorithm for approximating self-attention. Proceedings of the AAAI Conference on Artificial Intelligence, 2021.
  56. R-c3d: Region convolutional 3d network for temporal activity detection. In IEEE International Conference on Computer Vision (ICCV), pp.  5794–5803, 2017. doi: 10.1109/ICCV.2017.617.
  57. Temporal recurrent networks for online action detection. In IEEE/CVF International Conference on Computer Vision (ICCV), pp.  5531–5540, 2019. doi: 10.1109/ICCV.2019.00563. URL https://github.com/xumingze0308/TRN.pytorch. MIT Licence.
  58. Long short-term transformer for online action detection. In Conference on Neural Information Processing Systems (NeurIPS), 2021.
  59. Spatial temporal graph convolutional networks for skeleton-based action recognition. In AAAI Conference on Artificial Intelligence, pp. 7444–7452, 2018.
  60. Privileged knowledge distillation for online action detection. preprint, arXiv:2011.09158, abs/2011.09158, 2020.
Citations (9)

Summary

We haven't generated a summary for this paper yet.