TART: A plug-and-play Transformer module for task-agnostic reasoning (2306.07536v1)
Abstract: LLMs exhibit in-context learning abilities which enable the same model to perform several tasks without any task-specific training. In contrast, traditional adaptation approaches, such as fine-tuning, modify the underlying models for each specific task. In-context learning, however, consistently underperforms task-specific tuning approaches even when presented with the same examples. While most existing approaches (e.g., prompt engineering) focus on the LLM's learned representations to patch this performance gap, our analysis actually reveal that LLM representations contain sufficient information to make good predictions. As such, we focus on the LLM's reasoning abilities and demonstrate that this performance gap exists due to their inability to perform simple probabilistic reasoning tasks. This raises an intriguing question: Are LLMs actually capable of learning how to reason in a task-agnostic manner? We answer this in the affirmative and propose TART which generically improves an LLM's reasoning abilities using a synthetically trained Transformer-based reasoning module. TART trains this reasoning module in a task-agnostic manner using only synthetic logistic regression tasks and composes it with an arbitrary real-world pre-trained model without any additional training. With a single inference module, TART improves performance across different model families (GPT-Neo, Pythia, BLOOM), model sizes (100M - 6B), tasks (14 NLP binary classification tasks), and even across different modalities (audio and vision). Additionally, on the RAFT Benchmark, TART improves GPT-Neo (125M)'s performance such that it outperforms BLOOM (176B), and is within 4% of GPT-3 (175B). Our code and models are available at https://github.com/HazyResearch/TART .
- “Large language models are zero-shot clinical information extractors” In arXiv preprint arXiv:2205.12689, 2022
- “RAFT: A Real-World Few-Shot Text Classification Benchmark” In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021
- Tiago A. Almeida, Jose Maria Gomez Hidalgo and Akebo Yamakami “Contributions to the Study of SMS Spam Filtering: New Collection and Results” In Proceedings of the 2011 ACM Symposium on Document Engineering (DOCENG’11), 2011
- “Ask Me Anything: A simple strategy for prompting language models” In ICLR 2023, 2023
- “Pythia: A suite for analyzing large language models across training and scaling” In arXiv preprint arXiv:2304.01373, 2023
- “GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow” Zenodo, 2021
- “On the opportunities and risks of foundation models” In arXiv preprint arXiv:2108.07258, 2021
- “Nuanced Metrics for Measuring Unintended Bias with Real Data for Text Classification” In CoRR abs/1903.04561, 2019
- “Language models are few-shot learners” In Advances in neural information processing systems 33, 2020, pp. 1877–1901
- “Active Prompting with Chain-of-Thought for Large Language Models” In arXiv preprint arXiv:2302.12246, 2023
- “What can transformers learn in-context? a case study of simple function classes” In Advances in Neural Information Processing Systems 35, 2022, pp. 30583–30598
- “Parameter-efficient transfer learning for NLP” In International Conference on Machine Learning, 2019, pp. 2790–2799 PMLR
- “LoRA: Low-Rank Adaptation of Large Language Models” In International Conference on Learning Representations, 2022
- Chip Huyen “Prompting vs. Finetuning vs. Alternatives”, 2023
- “Chatgpt: Jack of all trades, master of none” In arXiv preprint arXiv:2302.10724, 2023
- “Large Language Models are Zero-Shot Reasoners” In ICML 2022 Workshop on Knowledge Retrieval and Language Models, 2022
- Alex Krizhevsky “Learning multiple layers of features from tiny images”, 2009
- “Fine-tuning can distort pretrained features and underperform out-of-distribution” In arXiv preprint arXiv:2202.10054, 2022
- Yann LeCun, Corinna Cortes and CJ Burges “MNIST handwritten digit database” In ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist 2, 2010
- Brian Lester, Rami Al-Rfou and Noah Constant “The Power of Scale for Parameter-Efficient Prompt Tuning” In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021, pp. 3045–3059
- Xiang Lisa Li and Percy Liang “Prefix-Tuning: Optimizing Continuous Prompts for Generation” In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021, pp. 4582–4597
- “Holistic evaluation of language models” In arXiv preprint arXiv:2211.09110, 2022
- “Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning” In Advances in Neural Information Processing Systems 35, 2022, pp. 1950–1965
- “What Makes Good In-Context Examples for GPT-3?” In Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, 2022, pp. 100–114
- “P-Tuning: Prompt Tuning Can Be Comparable to Fine-tuning Across Scales and Tasks” In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics Dublin, Ireland: Association for Computational Linguistics, 2022
- “Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity” In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 8086–8098
- “Learning Word Vectors for Sentiment Analysis” In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies Portland, Oregon, USA: Association for Computational Linguistics, 2011, pp. 142–150
- “Can Foundation Models Wrangle Your Data?” In Proc. VLDB Endow. 16.4 VLDB Endowment, 2022
- “Transformers learn in-context by gradient descent” In arXiv preprint arXiv:2212.07677, 2022
- Bo Pang, Lillian Lee and Shivakumar Vaithyanathan “Thumbs Up? Sentiment Classification Using Machine Learning Techniques” In Proceedings of EMNLP, 2002, pp. 79–86
- “Hyena hierarchy: Towards larger convolutional language models” In arXiv preprint arXiv:2302.10866, 2023
- “Probability theory: The logic of science” Cambridge university press, 2003
- “Robust Speech Recognition via Large-Scale Weak Supervision” arXiv, 2022
- “Improving language understanding by generative pre-training” In arXiv preprint, 2018
- “Bloom: A 176b-parameter open-access multilingual language model” In arXiv preprint arXiv:2211.05100, 2022
- “Understanding machine learning: From theory to algorithms” Cambridge university press, 2014
- “Recursive deep models for semantic compositionality over a sentiment treebank” In Proceedings of the 2013 conference on empirical methods in natural language processing, 2013, pp. 1631–1642
- “GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model”, https://github.com/kingoflolz/mesh-transformer-jax, 2021
- “Rationale-augmented ensembles in language models” In arXiv preprint arXiv:2207.00747, 2022
- “Self-consistency improves chain of thought reasoning in language models” In arXiv preprint arXiv:2203.11171, 2022
- P. Warden “Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition” In ArXiv e-prints, 2018
- “Emergent Abilities of Large Language Models” Survey Certification In Transactions on Machine Learning Research, 2022
- “Chain of thought prompting elicits reasoning in large language models” In arXiv preprint arXiv:2201.11903, 2022
- “Larger language models do in-context learning differently” In arXiv preprint arXiv:2303.03846, 2023
- “Visual Transformers: Token-based Image Representation and Processing for Computer Vision”, 2020
- “An explanation of in-context learning as implicit bayesian inference” In arXiv preprint arXiv:2111.02080, 2021
- “STaR: Bootstrapping Reasoning With Reasoning” In Advances in Neural Information Processing Systems, 2022
- “WRENCH: A Comprehensive Benchmark for Weak Supervision” In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2021
- “LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention” In arXiv preprint arXiv:2303.16199, 2023
- Xiang Zhang, Junbo Zhao and Yann LeCun “Character-level convolutional networks for text classification” In Advances in neural information processing systems 28, 2015