Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

On the Expressive Power of a Variant of the Looped Transformer (2402.13572v1)

Published 21 Feb 2024 in cs.LG, cs.AI, cs.NA, and math.NA

Abstract: Besides natural language processing, transformers exhibit extraordinary performance in solving broader applications, including scientific computing and computer vision. Previous works try to explain this from the expressive power and capability perspectives that standard transformers are capable of performing some algorithms. To empower transformers with algorithmic capabilities and motivated by the recently proposed looped transformer (Yang et al., 2024; Giannou et al., 2023), we design a novel transformer block, dubbed Algorithm Transformer (abbreviated as AlgoFormer). Compared with the standard transformer and vanilla looped transformer, the proposed AlgoFormer can achieve significantly higher expressiveness in algorithm representation when using the same number of parameters. In particular, inspired by the structure of human-designed learning algorithms, our transformer block consists of a pre-transformer that is responsible for task pre-processing, a looped transformer for iterative optimization algorithms, and a post-transformer for producing the desired results after post-processing. We provide theoretical evidence of the expressive power of the AlgoFormer in solving some challenging problems, mirroring human-designed algorithms. Furthermore, some theoretical and empirical results are presented to show that the designed transformer has the potential to be smarter than human-designed algorithms. Experimental results demonstrate the empirical superiority of the proposed transformer in that it outperforms the standard transformer and vanilla looped transformer in some challenging tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. Transformers learn to implement preconditioned gradient descent for in-context learning. Advances in Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=LziniAXEI9.
  2. What learning algorithm is in-context learning? investigations with linear models. In International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=0g0X4H8yN4I.
  3. Transformers as statisticians: Provable in-context learning with in-context algorithm selection. Advances in Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=liMSqUuVg9.
  4. Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877–1901, 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
  5. Cao, S. Choose a transformer: Fourier or galerkin. Advances in Neural Information Processing Systems, 34:24924–24940, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/d0921d442ee91b896ad95059d13df618-Abstract.html.
  6. Bert: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1:4171–4186, 2019. URL https://aclanthology.org/N19-1423.
  7. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=YicbFdNTTy.
  8. Inductive biases and variable creation in self-attention mechanisms. In International Conference on Machine Learning, pp.  5793–5831. PMLR, 2022. URL https://proceedings.mlr.press/v162/edelman22a.html.
  9. Towards revealing the mystery behind chain of thought: a theoretical perspective. Advances in Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=qHrADgAdYu.
  10. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, pp.  1126–1135. PMLR, 2017. URL https://proceedings.mlr.press/v70/finn17a.html.
  11. Transformers learn higher-order optimization methods for in-context learning: A study with linear models. arXiv preprint arXiv:2310.17086, 2023.
  12. What can transformers learn in-context? a case study of simple function classes. Advances in Neural Information Processing Systems, 35:30583–30598, 2022. URL https://openreview.net/forum?id=flNZJ2eOet.
  13. Looped transformers as programmable computers. In International Conference on Machine Learning, volume 202, pp.  11398–11442. PMLR, 2023. URL https://proceedings.mlr.press/v202/giannou23a.html.
  14. Deep neural networks for solving large linear systems arising from high-dimensional problems. SIAM Journal on Scientific Computing, 45(5):A2356–A2381, 2023.
  15. How do transformers learn in-context beyond simple functions? a case study on learning with representations. International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=ikwEDva1JZ.
  16. On the rate of convergence of a classifier based on a transformer encoder. IEEE Transactions on Information Theory, 68(12):8139–8155, 2022.
  17. In-context convergence of transformers. arXiv preprint arXiv:2310.05249, 2023.
  18. MathPrompter: Mathematical reasoning using large language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track), pp.  37–42. Association for Computational Linguistics, 2023. URL https://aclanthology.org/2023.acl-industry.4.
  19. The impact of positional encoding on length generalization in transformers. Advances in Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=Drrl2gcjzl.
  20. Diffusion models for black-box optimization. In International Conference on Machine Learning, volume 202, pp.  17842–17857. PMLR, 2023. URL https://proceedings.mlr.press/v202/krishnamoorthy23a.html.
  21. Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting. Advances in Neural Information Processing Systems, 32, 2019. URL https://papers.neurips.cc/paper_files/paper/2019/hash/6775a0635c302542da2c32aa19d86be0-Abstract.html.
  22. Dissecting chain-of-thought: Compositionality through in-context filtering and learning. In Advances in Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=xEhKwsqxMa.
  23. Meta-sgd: Learning to learn quickly for few-shot learning. arXiv preprint arXiv:1707.09835, 2017.
  24. One step of gradient descent is provably the optimal in-context learner with one layer of linear self-attention. International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=8p3fu56lKc.
  25. Making transformers solve compositional tasks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  3591–3607, 2022. URL https://aclanthology.org/2022.acl-long.251.
  26. An improved newton iteration for the generalized inverse of a matrix, with applications. SIAM Journal on Scientific and Statistical Computing, 12(5):1109–1130, 1991.
  27. Learning to learn with generative models of neural network checkpoints. arXiv preprint arXiv:2209.12892, 2022.
  28. Train short, test long: Attention with linear biases enables input length extrapolation. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=R8sQPpGCv0.
  29. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  30. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(1):5485–5551, 2020. URL https://jmlr.org/papers/volume21/20-074/20-074.pdf.
  31. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378:686–707, 2019.
  32. On the numerical properties of an iterative method for computing the moore–penrose generalized inverse. SIAM Journal on Numerical Analysis, 11(1):61–74, 1974.
  33. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pp.  2256–2265. PMLR, 2015. URL https://proceedings.mlr.press/v37/sohl-dickstein15.html.
  34. A survey of reasoning with foundation models. arXiv preprint arXiv:2312.11562, 2023.
  35. Approximation and estimation ability of transformers for sequence-to-sequence functions with infinite dimensional input. In International Conference on Machine Learning, volume 202, pp.  33416–33447. PMLR, 2023. URL https://proceedings.mlr.press/v202/takakura23a.html.
  36. Attention is all you need. Advances in Neural Information Processing Systems, 30, 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.
  37. Transformers learn in-context by gradient descent. In International Conference on Machine Learning, pp.  35151–35174. PMLR, 2023. URL https://proceedings.mlr.press/v202/von-oswald23a.html.
  38. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022. URL https://openreview.net/forum?id=_VjQlMeSB_J.
  39. Looped transformers are better at learning learning algorithms. In International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=HHbRxoDTxE.
  40. Metamath: Bootstrap your own mathematical questions for large language models. International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=N8N0hgNDRt.
  41. Trained transformers learn linear models in-context. arXiv preprint arXiv:2306.09927, 2023a.
  42. Applications of transformer-based language models in bioinformatics: a survey. Bioinformatics Advances, 3(1):vbad001, 2023b.
  43. Progressive-hint prompting improves reasoning in large language models. arXiv preprint arXiv:2304.09797, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Yihang Gao (13 papers)
  2. Chuanyang Zheng (21 papers)
  3. Enze Xie (84 papers)
  4. Han Shi (27 papers)
  5. Tianyang Hu (40 papers)
  6. Yu Li (378 papers)
  7. Michael K. Ng (69 papers)
  8. Zhenguo Li (195 papers)
  9. Zhaoqiang Liu (25 papers)
Citations (5)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets