Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Buffer Mechanism for Multi-Step Information Reasoning in Language Models (2405.15302v2)

Published 24 May 2024 in cs.AI, cs.CL, and cs.LG

Abstract: LLMs have consistently struggled with complex reasoning tasks, such as mathematical problem-solving. Investigating the internal reasoning mechanisms of these models can help us design better model architectures and training strategies, ultimately enhancing their reasoning capability. In this study, we constructed a symbolic dataset to investigate the mechanisms by which Transformer models employ vertical thinking strategy based on their inherent structure and horizontal thinking strategy based on Chain of Thought to achieve multi-step reasoning. We introduced the concept of buffer mechanism: the model stores various information in distinct buffers and selectively extracts them through the query-key matrix. We proposed a random matrix-based algorithm to enhance the model's reasoning ability, resulting in a 75% reduction in the training time required for the GPT-2 model to achieve generalization capability on the PrOntoQA dataset. These findings provide new insights into understanding the mechanisms of LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. Attention is all you need, Advances in neural information processing systems 30 (2017).
  2. Generating wikipedia by summarizing long sequences, arXiv preprint arXiv:1801.10198 (2018).
  3. Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018).
  4. Language models are unsupervised multitask learners (2019).
  5. Llama: Open and efficient foundation language models, arXiv preprint arXiv:2302.13971 (2023).
  6. Language models are few-shot learners, Advances in neural information processing systems 33 (2020) 1877–1901.
  7. A survey on in-context learning, arXiv preprint arXiv:2301.00234 (2022).
  8. What can transformers learn in-context? a case study of simple function classes, Advances in Neural Information Processing Systems 35 (2022) 30583–30598.
  9. Solving olympiad geometry without human demonstrations, Nature 625 (2024) 476–482.
  10. Advancing mathematics by guiding human intuition with ai, Nature 600 (2021) 70–74.
  11. In-context learning and induction heads, arXiv preprint arXiv:2209.11895 (2022).
  12. Chain-of-thought prompting elicits reasoning in large language models, Advances in neural information processing systems 35 (2022) 24824–24837.
  13. Large language models are zero-shot reasoners, Advances in neural information processing systems 35 (2022) 22199–22213.
  14. When can transformers reason with abstract symbols?, arXiv preprint arXiv:2310.09753 (2023).
  15. Generalizing from several related classification tasks to a new unlabeled sample, Advances in neural information processing systems 24 (2011).
  16. Neuron activation coverage: Rethinking out-of-distribution detection and generalization, arXiv preprint arXiv:2306.02879 (2023).
  17. Omnigrok: Grokking beyond algorithmic data, in: The Eleventh International Conference on Learning Representations, 2022.
  18. Grokking: Generalization beyond overfitting on small algorithmic datasets, arXiv preprint arXiv:2201.02177 (2022).
  19. Phi-3 technical report: A highly capable language model locally on your phone, arXiv preprint arXiv:2404.14219 (2024).
  20. Interpretability in the wild: a circuit for indirect object identification in gpt-2 small, arXiv preprint arXiv:2211.00593 (2022).
  21. Localizing model behavior with path patching, arXiv preprint arXiv:2304.05969 (2023).
  22. Birth of a transformer: A memory viewpoint, Advances in Neural Information Processing Systems 36 (2024).
  23. How transformers learn causal structure with gradient descent, arXiv preprint arXiv:2402.14735 (2024).
  24. The evolution of statistical induction heads: In-context learning markov chains, arXiv preprint arXiv:2402.11004 (2024).
  25. Training dynamics of multi-head softmax attention for in-context learning: Emergence, convergence, and optimality, arXiv preprint arXiv:2402.19442 (2024).
  26. Function vectors in large language models, arXiv preprint arXiv:2310.15213 (2023).
  27. X. Chen, D. Zou, What can transformer learn with varying depth? case studies on sequence learning tasks, arXiv preprint arXiv:2404.01601 (2024).
  28. Neural tangent kernel: Convergence and generalization in neural networks, Advances in neural information processing systems 31 (2018).
  29. On exact computation with an infinitely wide neural net, in: Advances in Neural Information Processing Systems, 2019, pp. 8141–8150.
  30. A type of generalization error induced by initialization in deep neural networks, arXiv:1905.07777 [cs, stat] (2019).
  31. Gradient dynamics of shallow univariate relu networks, CoRR abs/1906.07842 (2019). URL: http://arxiv.org/abs/1906.07842. arXiv:1906.07842.
  32. A comparative analysis of optimization and generalization properties of two-layer neural network and random feature models under gradient descent dynamics., Sci. China Math. 63 (2020).
  33. Phase diagram for two-layer relu neural networks at infinite-width limit, Journal of Machine Learning Research 22 (2021) 1–47.
  34. Empirical phase diagram for three-layer neural networks with infinite width, Advances in Neural Information Processing Systems (2022).
  35. Understanding the difficulty of training transformers, arXiv preprint arXiv:2004.08249 (2020).
  36. A. Trockman, J. Z. Kolter, Mimetic initialization of self-attention layers, in: International Conference on Machine Learning, PMLR, 2023, pp. 34456–34468.
  37. Improving transformer optimization through better initialization, in: International Conference on Machine Learning, PMLR, 2020, pp. 4475–4483.
  38. Initialization is critical to whether transformers fit composite functions by inference or memorizing, arXiv preprint arXiv:2405.05409 (2024).
  39. Layer normalization, arXiv preprint arXiv:1607.06450 (2016).
  40. On layer normalization in the transformer architecture, in: International Conference on Machine Learning, PMLR, 2020, pp. 10524–10533.
  41. Understanding and improving layer normalization, Advances in neural information processing systems 32 (2019).
  42. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned, arXiv preprint arXiv:1905.09418 (2019).
  43. J. Vig, A multiscale visualization of attention in the transformer model, arXiv preprint arXiv:1906.05714 (2019).
  44. Revealing the dark secrets of bert, arXiv preprint arXiv:1908.08593 (2019).
  45. Attention is not only a weight: Analyzing transformers with vector norms, arXiv preprint arXiv:2004.10102 (2020).
  46. Investigating gender bias in language models using causal mediation analysis, Advances in neural information processing systems 33 (2020) 12388–12401.
  47. S. Jeoung, J. Diesner, What changed? investigating debiasing methods using causal mediation analysis, arXiv preprint arXiv:2206.00701 (2022).
  48. Towards automated circuit discovery for mechanistic interpretability, Advances in Neural Information Processing Systems 36 (2023) 16318–16352.
  49. Circuit component reuse across tasks in transformer language models, arXiv preprint arXiv:2310.08744 (2023).
  50. Label words are anchors: An information flow perspective for understanding in-context learning, arXiv preprint arXiv:2305.14160 (2023).
  51. How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning, arXiv preprint arXiv:2402.18312 (2024).
  52. Mechanistic design and scaling of hybrid architectures, arXiv preprint arXiv:2403.17844 (2024).
  53. Anchor function: a type of benchmark functions for studying language models, arXiv preprint arXiv:2401.08309 (2024).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Zhiwei Wang (223 papers)
  2. Yunji Wang (3 papers)
  3. Zhongwang Zhang (17 papers)
  4. Zhangchen Zhou (8 papers)
  5. Hui Jin (9 papers)
  6. Tianyang Hu (40 papers)
  7. Jiacheng Sun (49 papers)
  8. Zhenguo Li (195 papers)
  9. Yaoyu Zhang (43 papers)
  10. Zhi-Qin John Xu (66 papers)
Citations (4)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com