Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TART: A plug-and-play Transformer module for task-agnostic reasoning (2306.07536v1)

Published 13 Jun 2023 in cs.LG, cs.AI, and cs.CL

Abstract: LLMs exhibit in-context learning abilities which enable the same model to perform several tasks without any task-specific training. In contrast, traditional adaptation approaches, such as fine-tuning, modify the underlying models for each specific task. In-context learning, however, consistently underperforms task-specific tuning approaches even when presented with the same examples. While most existing approaches (e.g., prompt engineering) focus on the LLM's learned representations to patch this performance gap, our analysis actually reveal that LLM representations contain sufficient information to make good predictions. As such, we focus on the LLM's reasoning abilities and demonstrate that this performance gap exists due to their inability to perform simple probabilistic reasoning tasks. This raises an intriguing question: Are LLMs actually capable of learning how to reason in a task-agnostic manner? We answer this in the affirmative and propose TART which generically improves an LLM's reasoning abilities using a synthetically trained Transformer-based reasoning module. TART trains this reasoning module in a task-agnostic manner using only synthetic logistic regression tasks and composes it with an arbitrary real-world pre-trained model without any additional training. With a single inference module, TART improves performance across different model families (GPT-Neo, Pythia, BLOOM), model sizes (100M - 6B), tasks (14 NLP binary classification tasks), and even across different modalities (audio and vision). Additionally, on the RAFT Benchmark, TART improves GPT-Neo (125M)'s performance such that it outperforms BLOOM (176B), and is within 4% of GPT-3 (175B). Our code and models are available at https://github.com/HazyResearch/TART .

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. “Large language models are zero-shot clinical information extractors” In arXiv preprint arXiv:2205.12689, 2022
  2. “RAFT: A Real-World Few-Shot Text Classification Benchmark” In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021
  3. Tiago A. Almeida, Jose Maria Gomez Hidalgo and Akebo Yamakami “Contributions to the Study of SMS Spam Filtering: New Collection and Results” In Proceedings of the 2011 ACM Symposium on Document Engineering (DOCENG’11), 2011
  4. “Ask Me Anything: A simple strategy for prompting language models” In ICLR 2023, 2023
  5. “Pythia: A suite for analyzing large language models across training and scaling” In arXiv preprint arXiv:2304.01373, 2023
  6. “GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow” Zenodo, 2021
  7. “On the opportunities and risks of foundation models” In arXiv preprint arXiv:2108.07258, 2021
  8. “Nuanced Metrics for Measuring Unintended Bias with Real Data for Text Classification” In CoRR abs/1903.04561, 2019
  9. “Language models are few-shot learners” In Advances in neural information processing systems 33, 2020, pp. 1877–1901
  10. “Active Prompting with Chain-of-Thought for Large Language Models” In arXiv preprint arXiv:2302.12246, 2023
  11. “What can transformers learn in-context? a case study of simple function classes” In Advances in Neural Information Processing Systems 35, 2022, pp. 30583–30598
  12. “Parameter-efficient transfer learning for NLP” In International Conference on Machine Learning, 2019, pp. 2790–2799 PMLR
  13. “LoRA: Low-Rank Adaptation of Large Language Models” In International Conference on Learning Representations, 2022
  14. Chip Huyen “Prompting vs. Finetuning vs. Alternatives”, 2023
  15. “Chatgpt: Jack of all trades, master of none” In arXiv preprint arXiv:2302.10724, 2023
  16. “Large Language Models are Zero-Shot Reasoners” In ICML 2022 Workshop on Knowledge Retrieval and Language Models, 2022
  17. Alex Krizhevsky “Learning multiple layers of features from tiny images”, 2009
  18. “Fine-tuning can distort pretrained features and underperform out-of-distribution” In arXiv preprint arXiv:2202.10054, 2022
  19. Yann LeCun, Corinna Cortes and CJ Burges “MNIST handwritten digit database” In ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist 2, 2010
  20. Brian Lester, Rami Al-Rfou and Noah Constant “The Power of Scale for Parameter-Efficient Prompt Tuning” In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021, pp. 3045–3059
  21. Xiang Lisa Li and Percy Liang “Prefix-Tuning: Optimizing Continuous Prompts for Generation” In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021, pp. 4582–4597
  22. “Holistic evaluation of language models” In arXiv preprint arXiv:2211.09110, 2022
  23. “Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning” In Advances in Neural Information Processing Systems 35, 2022, pp. 1950–1965
  24. “What Makes Good In-Context Examples for GPT-3?” In Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, 2022, pp. 100–114
  25. “P-Tuning: Prompt Tuning Can Be Comparable to Fine-tuning Across Scales and Tasks” In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics Dublin, Ireland: Association for Computational Linguistics, 2022
  26. “Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity” In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 8086–8098
  27. “Learning Word Vectors for Sentiment Analysis” In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies Portland, Oregon, USA: Association for Computational Linguistics, 2011, pp. 142–150
  28. “Can Foundation Models Wrangle Your Data?” In Proc. VLDB Endow. 16.4 VLDB Endowment, 2022
  29. “Transformers learn in-context by gradient descent” In arXiv preprint arXiv:2212.07677, 2022
  30. Bo Pang, Lillian Lee and Shivakumar Vaithyanathan “Thumbs Up? Sentiment Classification Using Machine Learning Techniques” In Proceedings of EMNLP, 2002, pp. 79–86
  31. “Hyena hierarchy: Towards larger convolutional language models” In arXiv preprint arXiv:2302.10866, 2023
  32. “Probability theory: The logic of science” Cambridge university press, 2003
  33. “Robust Speech Recognition via Large-Scale Weak Supervision” arXiv, 2022
  34. “Improving language understanding by generative pre-training” In arXiv preprint, 2018
  35. “Bloom: A 176b-parameter open-access multilingual language model” In arXiv preprint arXiv:2211.05100, 2022
  36. “Understanding machine learning: From theory to algorithms” Cambridge university press, 2014
  37. “Recursive deep models for semantic compositionality over a sentiment treebank” In Proceedings of the 2013 conference on empirical methods in natural language processing, 2013, pp. 1631–1642
  38. “GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model”, https://github.com/kingoflolz/mesh-transformer-jax, 2021
  39. “Rationale-augmented ensembles in language models” In arXiv preprint arXiv:2207.00747, 2022
  40. “Self-consistency improves chain of thought reasoning in language models” In arXiv preprint arXiv:2203.11171, 2022
  41. P. Warden “Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition” In ArXiv e-prints, 2018
  42. “Emergent Abilities of Large Language Models” Survey Certification In Transactions on Machine Learning Research, 2022
  43. “Chain of thought prompting elicits reasoning in large language models” In arXiv preprint arXiv:2201.11903, 2022
  44. “Larger language models do in-context learning differently” In arXiv preprint arXiv:2303.03846, 2023
  45. “Visual Transformers: Token-based Image Representation and Processing for Computer Vision”, 2020
  46. “An explanation of in-context learning as implicit bayesian inference” In arXiv preprint arXiv:2111.02080, 2021
  47. “STaR: Bootstrapping Reasoning With Reasoning” In Advances in Neural Information Processing Systems, 2022
  48. “WRENCH: A Comprehensive Benchmark for Weak Supervision” In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2021
  49. “LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention” In arXiv preprint arXiv:2303.16199, 2023
  50. Xiang Zhang, Junbo Zhao and Yann LeCun “Character-level convolutional networks for text classification” In Advances in neural information processing systems 28, 2015
Citations (12)

Summary

  • The paper introduces Tart, a synthetically trained reasoning module designed to perform probabilistic inference across diverse tasks.
  • It demonstrates broad compatibility and scalability by enhancing models from 100M to 6B parameters and extending its use to vision and audio modalities.
  • Tart narrows the in-context learning gap by effectively reducing reasoning deficiencies, achieving near task-specific tuning performance on multiple benchmarks.

Overview of Tart: A Plug-and-Play Transformer Module for Task-Agnostic Reasoning

The paper introduces Tart, an innovative modular approach aimed at enhancing the reasoning capabilities of LLMs in a task-agnostic manner. Traditional methods such as fine-tuning and prompt engineering have focused on task-specific adaptations, often lacking the flexibility and scalability required for broader applications. Tart addresses the persistent performance gap between in-context learning and task-specific tuning by emphasizing reasoning improvements rather than merely optimizing representations.

Core Contributions

  1. Reasoning Module Training: The authors present a synthetically trained Transformer-based reasoning module designed to perform probabilistic inference across a diverse set of tasks. This module is trained exclusively using synthetic logistic regression tasks, allowing it to generalize without being explicitly fine-tuned for specific applications.
  2. Model Compatibility and Scalability: Tart is demonstrated to enhance models from varied families (GPT-Neo, Pythia, BLOOM), across a wide size range (100M to 6B parameters), and even extends to different modalities, including vision and audio. It surpasses the performance of 176 billion parameter models like BLOOM when coupled with smaller models such as GPT-Neo (125M).
  3. In-Context Learning Limitations: The research highlights that LLMs, while proficient in generating rich representations, frequently falter in executing reasoning tasks. By employing task-specific adapters, up to 90% of this gap can be closed without altering core model parameters—suggesting that the reasoning deficiency is a key bottleneck.
  4. Performance and Benchmarking: The paper includes extensive evaluations over 14 diverse NLP tasks and showcases significant improvements where Tart, despite its task-agnostic framework, achieves near-parity with task-specific fine-tuning approaches. On benchmarks like RAFT, Tart displays exemplary performance, closely approaching that of models with orders of magnitude more parameters.

Implications and Future Work

The insights and methodologies introduced in this paper open various avenues for the future of AI development:

  • Theoretical Generalization Analysis: The authors provide theoretical insights suggesting that the effectiveness of Tart hinges on minimizing distributional shifts between synthetic and real data representations. This theoretical groundwork offers a robust foundation for the development and assessment of future reasoning modules.
  • Modality Expansion and Multi-Task Applications: Tart’s demonstrated efficacy across NLP, vision, and audio indicates the potential for adaptive models in multi-modal AI applications. Expanding this framework to handle complex multi-task environments will be pivotal.
  • Broader Integration and Deployment: Practically, Tart could revolutionize the deployment scenarios of LLMs in edge computing and environments where resources are constrained, leveraging a single reasoning module across different applications without bespoke fine-tuning.

In conclusion, Tart represents a meaningful advancement in the quest for task-agnostic reasoning improvements within AI. By shifting the focus towards addressing reasoning deficiencies, the authors provide a scalable and efficient approach capable of redefining the adaptation strategies in LLMs. This paper holds promise for transforming how AI models are adapted to rapidly changing and diverse environments without sacrificing performance or incurring prohibitive computational costs. Future research could further explore extensions to additional reasoning paradigms and more sophisticated cognitive tasks.

Github Logo Streamline Icon: https://streamlinehq.com