Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Zoology: Measuring and Improving Recall in Efficient Language Models (2312.04927v1)

Published 8 Dec 2023 in cs.CL and cs.LG
Zoology: Measuring and Improving Recall in Efficient Language Models

Abstract: Attention-free LLMs that combine gating and convolutions are growing in popularity due to their efficiency and increasingly competitive performance. To better understand these architectures, we pretrain a suite of 17 attention and "gated-convolution" LLMs, finding that SoTA gated-convolution architectures still underperform attention by up to 2.1 perplexity points on the Pile. In fine-grained analysis, we find 82% of the gap is explained by each model's ability to recall information that is previously mentioned in-context, e.g. "Hakuna Matata means no worries Hakuna Matata it means no" $\rightarrow$ "??". On this task, termed "associative recall", we find that attention outperforms gated-convolutions by a large margin: a 70M parameter attention model outperforms a 1.4 billion parameter gated-convolution model on associative recall. This is surprising because prior work shows gated convolutions can perfectly solve synthetic tests for AR capability. To close the gap between synthetics and real language, we develop a new formalization of the task called multi-query associative recall (MQAR) that better reflects actual language. We perform an empirical and theoretical study of MQAR that elucidates differences in the parameter-efficiency of attention and gated-convolution recall. Informed by our analysis, we evaluate simple convolution-attention hybrids and show that hybrids with input-dependent sparse attention patterns can close 97.4% of the gap to attention, while maintaining sub-quadratic scaling. Our code is accessible at: https://github.com/HazyResearch/zoology.

LLMs: Exploring Gated Convolutions and Associative Recall

The paper, titled "Zoology: Measuring and Improving Recall in Efficient LLMs," presents an in-depth analysis of attention-free LLMs, particularly those utilizing gating and convolutions. It aims to understand their performance relative to traditional attention-based models, particularly in terms of associative recall (AR).

Overview

The paper pretrains 17 LLMs across various scales and architectures to evaluate their performance against attention-based models. Key findings show that state-of-the-art gated-convolution architectures underperform attention-based models by up to 2.1 perplexity points on the Pile dataset. Notably, 82% of this gap is attributed to each model's ability to perform associative recall.

Associative recall, pivotal in LLMing, involves recalling information previously mentioned within the context. The paper highlights that a 70M parameter attention model surpasses a 1.4 billion parameter gated-convolution model in AR capability.

Associative Recall and Multi-Query Tasks

The paper introduces a novel task, multi-query associative recall (Mqar), which epitomizes the challenges faced by gated-convolutions in real language scenarios. The Mqar task requires models to perform multiple recalls at varying positions within a sequence, emphasizing the inherent differences in input-dependent processing.

Implications of the Study

The empirical and theoretical assessments indicate that gated-convolution models require model dimensions that scale with sequence length to solve Mqar effectively. Attention models, however, handle this with consistent dimensionality, showcasing their superior parameter efficiency. To bridge this gap, the authors explore hybrid models that blend convolutional and attention mechanisms. These hybrids demonstrate a 97.4% closure of the gap to attention models while maintaining sub-quadratic scaling.

Practical and Theoretical Contributions

From a practical standpoint, the research suggests architectural modifications to existing gated-convolution models. By incorporating input-dependent sparse attention patterns, such modifications can achieve near-parity with attention-based models, significantly improving AR performance while remaining computationally efficient.

Theoretically, the paper extends our understanding of LLM architectures, challenging the notion that attention alone is the superior approach. It presents a compelling argument for the integration of input-dependent computations to enhance associative recall capabilities in LLMs.

Future Directions

The exploration of gated-convolution models signals promising pathways for future research in AI. As the paper suggests, incorporating input-dependent sequence mixing could spur innovations in model architectures that balance efficiency and performance. Future work might extend to exploring other architecture classes and their interactions with associative recall tasks, potentially leading to groundbreaking advancements in efficient AI systems.

In conclusion, this paper makes significant strides in dissecting LLMing architectures, particularly focusing on the crucial task of associative recall. Its insights offer practical implications for model design and theoretical contributions to our understanding of sequence processing in AI systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (79)
  1. Hungry Hungry Hippos: Towards language modeling with state space models. In International Conference on Learning Representations, 2023a.
  2. Mega: Moving average equipped gated attention. arXiv preprint arXiv:2209.10655, 2022.
  3. Pretraining without attention. arXiv preprint arXiv:2212.10544, 2022.
  4. Hyena hierarchy: Towards larger convolutional language models. arXiv preprint arXiv:2302.10866, 2023a.
  5. Language modeling with gated convolutional networks. In International conference on machine learning, pages 933–941. PMLR, 2017.
  6. Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396, 2021.
  7. Rwkv: Reinventing rnns for the transformer era. arXiv:2305.13048, 2023.
  8. Monarch mixer: A simple sub-quadratic gemm-based architecture, 2023b.
  9. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014.
  10. Using fast weights to attend to the recent past. Advances in neural information processing systems, 29, 2016.
  11. A mathematical framework for transformer circuits. Transformer Circuits Thread, 1, 2021.
  12. In-context learning and induction heads. arXiv preprint arXiv:2209.11895, 2022.
  13. Focus your attention (with adaptive iir filters), 2023.
  14. Michael Hahn. Theoretical limitations of self-attention in neural sequence models. In Transactions of the Association for Computational Linguistics, volume 8, 2020.
  15. Saturated transformers are constant-depth threshold circuits. In Transactions of the Association for Computational Linguistics, volume 10, 2022.
  16. On the computational complexity of self-attention. In 34th International Conference on Algorithmic Learning Theory, volume 201, page 1–23, 2023.
  17. The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  18. Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288, 2023.
  19. Attention is all you need. volume 30, 2017.
  20. An algorithm for the machine calculation of complex fourier series. Mathematics of computation, 19(90):297–301, 1965.
  21. Efficient transformers: A survey. ACM Computing Surveys, 55(6):1–28, 2022.
  22. It’s raw! audio generation with state-space models. Proceedings of the 39 th International Conference on Machine Learning,, 2022.
  23. Effectively modeling time series with simple discrete state spaces. International Conference on Learning Representations, 2023.
  24. An attention free transformer. arXiv preprint arXiv:2105.14103, 2021.
  25. GPT-NeoX: Large Scale Autoregressive Language Modeling in PyTorch, 9 2023. URL https://www.github.com/eleutherai/gpt-neox.
  26. Liquid structural state-space models. arXiv preprint arXiv:2209.12951, 2022.
  27. Retentive network: A successor to transformer for large language models, 2023.
  28. Domino: Discovering systematic errors with cross-modal embeddings. In International Conference on Learning Representations, March 2022.
  29. Non-holographic associative memory. Nature, 222(5197):960–962, 1969.
  30. John J Hopfield. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the national academy of sciences, 79(8):2554–2558, 1982.
  31. Algebraic Complexity Theory. Grundlehren der mathematischen Wissenschaften. Springer Berlin Heidelberg, 1996. ISBN 9783540605829. URL https://books.google.com/books?id=dYcgjfXsYk8C.
  32. Kaleidoscope: An efficient, learnable representation for all structured linear maps. arXiv preprint arXiv:2012.14966, 2020.
  33. Simple hardware-efficient long convolutions for sequence modeling. arXiv preprint arXiv:2302.06646, 2023c.
  34. Big bird: Transformers for longer sequences. Proceedings of NeurIPS, 2020.
  35. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019a.
  36. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
  37. Sparse modular activation for efficient sequence modeling. Sparse Modular Activation for Efficient Sequence Modeling, 2023.
  38. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, 2022.
  39. Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. 2023.
  40. Transformers are rnns: Fast autoregressive transformers with linear attention. In Proceedings of the International Conference on Machine Learning (ICML), 2020a. URL https://arxiv.org/abs/2006.16236.
  41. Chris Olah. Mechanistic interpretability, variables, and the importance of interpretable bases: An informal note on some intuitions related to mechanistic interpretability, 2022. URL https://transformer-circuits.pub/2022/mech-interp-essay/index.html.
  42. Grokking: Generalization beyond overfitting on small algorithmic datasets. arXiv:2201.02177, 2022.
  43. Thread: circuits. Distill, 5(3):e24, 2020.
  44. The galactic dependencies treebanks: Getting more data by synthesizing new languages. Transactions of the Association for Computational Linguistics, 4:491–505, 2016.
  45. Examining the inductive bias of neural language models with artificial languages. arXiv preprint arXiv:2106.01044, 2021.
  46. Physics of language models: Part 1, context-free grammar. arXiv preprint arXiv:2305.13673, 2023.
  47. Studying the inductive biases of rnns with synthetic variations of natural languages. arXiv preprint arXiv:1903.06400, 2019.
  48. An explanation of in-context learning as implicit bayesian inference. arXiv preprint arXiv:2111.02080, 2021.
  49. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451, 2020.
  50. Parallel models of associative memory, 1981.
  51. Learning to update auto-associative memory in recurrent neural networks for improving sequence memorization. arXiv preprint arXiv:1709.06493, 2017.
  52. Laughing hyena distillery: Extracting compact recurrences from convolutions. 2023.
  53. Ckconv: Continuous kernel convolution for sequential data. 2022.
  54. Diagonal state spaces are as effective as structured state spaces, 2022.
  55. On the parameterization and initialization of diagonal state space models, 2022.
  56. Long range language modeling via gated state spaces, 2022.
  57. Simplified state space layers for sequence modeling, 2023.
  58. Hyenadna: Long-range genomic sequence modeling at single nucleotide resolution, 2023.
  59. Transformers revolutionized ai. what will replace them?, 2023. URL https://www.forbes.com/sites/robtoews/2023/09/03/transformers-revolutionized-ai-what-will-replace-them/?sh=6ed698269c1f.
  60. Selective structured state-spaces for long-form video understanding. In CVPR, 2023.
  61. Condconv: Conditionally parametrized convolutions for efficient inference. In 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), 2020.
  62. Time-parameterized convolutional neural networks for irregularly sampled time series. 2023.
  63. Transformers are rnns: Fast autoregressive transformers with linear attention, 2020b.
  64. Together. Redpajama: An open source recipe to reproduce llama training dataset, 2023. URL https://github.com/togethercomputer/RedPajama-Data.
  65. Adaptive attention span in transformers. Association of Computational Linguistics, 2019.
  66. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  67. Anonymous. Mamba: Linear-time sequence modeling with selective state spaces. 2023a. URL https://openreview.net/forum?id=AL1fq05o7H.
  68. Anonymous. Gateloop: Fully data-controlled linear recurrence for sequence modeling. 2023b. URL https://openreview.net/pdf?id=02Ug9N8DCI.
  69. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019b.
  70. Blockwise self-attention for long document understanding. arXiv preprint arXiv:1911.02972, 2019.
  71. Multiplicative complexity, convolution, and the DFT. Springer, 1988.
  72. Ilya Volkovich. A guide to learning arithmetic circuits. In Conference on Learning Theory, pages 1540–1561. PMLR, 2016.
  73. Hyena hierarchy: Towards larger convolutional language models. Proceedings of the 40th International Conference on Machine Learning, 2023b.
  74. The one-way communication complexity of hamming distance. Theory of Computing, 4(1):129–135, 2008.
  75. Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409, 2021.
  76. Parallel binary search. IEEE Transactions on Parallel & Distributed Systems, 1(02):247–250, 1990.
  77. Introduction to algorithms. MIT press, 2022.
  78. An 0 (n log n) sorting network. In Proceedings of the fifteenth annual ACM symposium on Theory of computing, pages 1–9, 1983.
  79. Chris Chatfield. The analysis of time series: An introduction, fifth edition. 1995.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Simran Arora (64 papers)
  2. Sabri Eyuboglu (13 papers)
  3. Aman Timalsina (6 papers)
  4. Isys Johnson (4 papers)
  5. Michael Poli (33 papers)
  6. James Zou (232 papers)
  7. Atri Rudra (55 papers)
  8. Christopher Ré (194 papers)
Citations (44)
Github Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com