Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LongVQ: Long Sequence Modeling with Vector Quantization on Structured Memory (2404.11163v2)

Published 17 Apr 2024 in cs.LG

Abstract: Transformer models have been successful in various sequence processing tasks, but the self-attention mechanism's computational cost limits its practicality for long sequences. Although there are existing attention variants that improve computational efficiency, they have a limited ability to abstract global information effectively based on their hand-crafted mixing strategies. On the other hand, state-space models (SSMs) are tailored for long sequences but cannot capture complicated local information. Therefore, the combination of them as a unified token mixer is a trend in recent long-sequence models. However, the linearized attention degrades performance significantly even when equipped with SSMs. To address the issue, we propose a new method called LongVQ. LongVQ uses the vector quantization (VQ) technique to compress the global abstraction as a length-fixed codebook, enabling the linear-time computation of the attention matrix. This technique effectively maintains dynamic global and local patterns, which helps to complement the lack of long-range dependency issues. Our experiments on the Long Range Arena benchmark, autoregressive LLMing, and image and speech classification demonstrate the effectiveness of LongVQ. Our model achieves significant improvements over other sequence models, including variants of Transformers, Convolutions, and recent State Space Models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. ETC: Encoding long and structured inputs in transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 268–284, Online, 2020. Association for Computational Linguistics.
  2. Adaptive input representations for neural language modeling, 2019.
  3. Trellis networks for sequence modeling. arXiv preprint arXiv:1810.06682, 2018.
  4. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
  5. Generating long sequences with sparse transformers. arxiv 2020. arXiv preprint arXiv:1904.10509.
  6. Rethinking attention with performers. arXiv preprint arXiv:2009.14794, 2020.
  7. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
  8. An image is worth 16x16 words: Transformers for image recognition at scale, 2021.
  9. Lipschitz recurrent neural networks. arXiv preprint arXiv:2006.12070, 2020.
  10. Taming transformers for high-resolution image synthesis, 2021.
  11. Hungry hungry hippos: Towards language modeling with state space models, 2023.
  12. Simple hardware-efficient long convolutions for sequence modeling, 2023.
  13. Hippo: Recurrent memory with optimal polynomial projections. Advances in neural information processing systems, 33:1474–1487, 2020.
  14. Improving the gating mechanism of recurrent neural networks. In International Conference on Machine Learning, pages 3800–3809. PMLR, 2020.
  15. Efficiently modeling long sequences with structured state spaces. ArXiv preprint, abs/2111.00396, 2021.
  16. Combining recurrent, convolutional, and continuous-time models with linear state space layers. Advances in neural information processing systems, 34:572–585, 2021.
  17. On the parameterization and initialization of diagonal state space models. Advances in Neural Information Processing Systems, 35:35971–35983, 2022.
  18. Diagonal state spaces are as effective as structured state spaces. Advances in Neural Information Processing Systems, 35:22982–22994, 2022.
  19. Liquid structural state-space models. arXiv preprint arXiv:2209.12951, 2022.
  20. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  21. Transformer quality in linear time. In International Conference on Machine Learning, pages 9099–9117. PMLR, 2022.
  22. Transformer quality in linear time, 2022.
  23. Perceiver: General perception with iterative attention, 2021.
  24. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pages 5156–5165. PMLR, 2020.
  25. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451, 2020.
  26. Cifar-10 (canadian institute for advanced research).
  27. What makes convolutional models great on long sequence modeling? arXiv preprint arXiv:2210.09298, 2022.
  28. Lucas D. Lingle. Transformer-vq: Linear-time transformers via vector quantization, 2023.
  29. Automix: Unveiling the power of mixup for stronger classifiers, 2022.
  30. Luna: Linear unified nested attention. Advances in Neural Information Processing Systems, 34:2441–2453, 2021.
  31. Mega: Moving average equipped gated attention. ArXiv preprint, abs/2209.10655, 2022.
  32. Matt Mahoney. Large text compression benchmark, 2011.
  33. Long range language modeling via gated state spaces, 2022.
  34. Pointer sentinel mixture models, 2016.
  35. Learning problem-agnostic speech representations from multiple self-supervised tasks, 2019.
  36. Random feature attention. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
  37. Rwkv: Reinventing rnns for the transformer era, 2023.
  38. Hyena hierarchy: Towards larger convolutional language models, 2023.
  39. Compressive transformers for long-range sequence modelling, 2019.
  40. Searching for activation functions, 2017.
  41. Combiner: Full attention transformer with sparse computation cost, 2021.
  42. Flexconv: Continuous kernel convolutions with differentiable kernel sizes. arXiv preprint arXiv:2110.08059, 2021.
  43. Efficient content-based sparse attention with routing transformers, 2020.
  44. Simplified state space layers for sequence modeling. arXiv preprint arXiv:2208.04933, 2022.
  45. Sparse sinkhorn attention. In International Conference on Machine Learning. PMLR, 2020.
  46. Long range arena: A benchmark for efficient transformers. arXiv preprint arXiv:2011.04006, 2020.
  47. Learning longer-term dependencies in rnns with auxiliary losses. In International Conference on Machine Learning, pages 4965–4974. PMLR, 2018.
  48. Neural discrete representation learning, 2018.
  49. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  50. Fast transformers with clustered attention. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
  51. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020.
  52. Pete Warden. Speech commands: A dataset for limited-vocabulary speech recognition, 2018.
  53. Big bird: Transformers for longer sequences. Advances in neural information processing systems, 33:17283–17297, 2020.
  54. Poolingformer: Long document modeling with pooling attention. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 12437–12446. PMLR, 2021.
  55. H-transformer-1d: Fast one-dimensional hierarchical attention for sequences. arXiv preprint arXiv:2107.11906, 2021.
  56. Long-short transformer: Efficient transformers for language and vision. Advances in Neural Information Processing Systems, 34, 2021.
  57. Efficient long sequence modeling via state space augmented transformer, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Zicheng Liu (153 papers)
  2. Li Wang (470 papers)
  3. Siyuan Li (140 papers)
  4. Zedong Wang (15 papers)
  5. Haitao Lin (63 papers)
  6. Stan Z. Li (222 papers)
Citations (1)