Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
124 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Teaching Smaller Language Models To Generalise To Unseen Compositional Questions (2308.00946v2)

Published 2 Aug 2023 in cs.CL and cs.AI

Abstract: We equip a smaller LLM to generalise to answering challenging compositional questions that have not been seen in training. To do so we propose a combination of multitask supervised pretraining on up to 93 tasks designed to instill diverse reasoning abilities, and a dense retrieval system that aims to retrieve a set of evidential paragraph fragments. Recent progress in question-answering has been achieved either through prompting methods against very large pretrained LLMs in zero or few-shot fashion, or by fine-tuning smaller models, sometimes in conjunction with information retrieval. We focus on the less explored question of the extent to which zero-shot generalisation can be enabled in smaller models with retrieval against a corpus within which sufficient information to answer a particular question may not exist. We establish strong baselines in this setting for diverse evaluation datasets (StrategyQA, CommonsenseQA, IIRC, DROP, Musique and ARC-DA), and show that performance can be significantly improved by adding retrieval-augmented training datasets which are designed to expose our models to a variety of heuristic reasoning strategies such as weighing partial evidence or ignoring an irrelevant context.

Citations (2)

Summary

  • The paper demonstrates that combining multitask pretraining with dense retrieval enhances small language model generalization on unseen compositional questions.
  • It introduces a query transformation strategy that reframes complex question answering as a reading comprehension task beyond conventional two-hop retrieval.
  • Experimental results, including on StrategyQA, reveal that retrieval-augmented training significantly narrows the performance gap with larger models.

Enhancing Smaller LLMs with Multitask Pretraining and Dense Retrieval for Compositional Question Answering

Introduction

Recent advances have shown the effectiveness of large pretrained LLMs in question answering tasks, capable of understanding compositional questions never seen during training. However, the applicability of these models can be limited by practical considerations such as latency, cost, and energy efficiency. This paper contributes to the field by exploring the extent to which smaller LLMs, enhanced by multitask pretraining and dense retrieval systems, can generalize to answering complex, unseen questions. Specifically, it investigates a model pretrained on 93 diverse reasoning tasks and augmented with a dense retrieval system, focusing on compositional questions where answers may not be directly inferable from a given corpus.

The paper builds upon and extends previous methodologies in retrieval-augmented question answering. It significantly advances the multitask pretraining approach, involving a broader range of tasks designed to instill versatile reasoning strategies. Unlike sole reliance on LLMs' parameters for knowledge encoding, this work utilizes a query transformation strategy, transforming question-answering into a problem of reading comprehension by retrieving relevant information from an external corpus. Moreover, it moves beyond two-hop retrieval limits, enabling the capture of more complex reasoning paths. Comparative analyses show the iteratively enhanced retrieval, reranking, and scoring system exhibits promising alignment with human reasoning patterns, especially in multi-hop question answering contexts.

Experiments and Results

The research evaluates its hypothesis using six diverse evaluation datasets tailored to test textual and numerical reasoning abilities. Notably, the models trained with retrieval-augmented training datasets (RATD) significantly outperformed the baselines, demonstrating the capacity of smaller models to generalize from observed compositional reasoning to unseen problems effectively. However, the paper also uncovers challenges in the models' numerical literacy and in handling unanswerable questions, especially in contexts where retrieval systems may introduce plausible but misleading information.

In detailed experiments comparing baseline models without RATD datasets to those augmented with them, findings consistently show an improvement in performance across various datasets when models are equipped with heuristic reasoning strategies acquired through RATD. For instance, on the StrategyQA dataset, augmented models demonstrated superior generalization abilities, even approaching the performance levels of much larger LLMs in certain contexts.

Discussion and Future Directions

The paper's findings underline the potential of combining multitask pretraining with sophisticated retrieval mechanisms to enhance smaller models' performance on complex question-answering tasks. This approach not only contributes to the development of more accessible and versatile AI tools but also offers insights into the mechanics of knowledge application and reasoning in AI systems. The extension of multitask pretraining to incorporate a more extensive array of reasoning strategies and the refinement of retrieval systems for better context relevance and evidence scoring are outlined as promising areas for future research. Additionally, the paper hints at exploring the balance between encoded knowledge in model parameters and dynamically retrieved information for efficient and accurate problem-solving in AI.

Conclusion

This paper makes significant strides in advancing the capabilities of smaller LLMs for question answering, bridging the gap between the advantages of dense retrieval systems and the generalizability afforded by multitask pretraining. Its contributions not only enhance the understanding of how AI can mimic complex human reasoning patterns but also set a new benchmark for future work in making AI both more efficient and effective across a broader range of applications.