General-Purpose Question-Answering with Macaw (2109.02593v1)

Published 6 Sep 2021 in cs.CL and cs.AI

Abstract: Despite the successes of pretrained LLMs, there are still few high-quality, general-purpose QA systems that are freely available. In response, we present Macaw, a versatile, generative question-answering (QA) system that we are making available to the community. Macaw is built on UnifiedQA, itself built on T5, and exhibits strong performance, zero-shot, on a wide variety of topics, including outperforming GPT-3 by over 10% (absolute) on Challenge300, a suite of 300 challenge questions, despite being an order of magnitude smaller (11 billion vs. 175 billion parameters). In addition, Macaw allows different permutations ("angles") of its inputs and outputs to be used, for example Macaw can take a question and produce an answer; or take an answer and produce a question; or take an answer and question, and produce multiple-choice options. We describe the system, and illustrate a variety of question types where it produces surprisingly good answers, well outside the training setup. We also identify question classes where it still appears to struggle, offering insights into the limitations of pretrained LLMs. Macaw is freely available, and we hope that it proves useful to the community. Macaw is available at https://github.com/allenai/macaw

Citations (59)

View on Semantic Scholar

Collections

Summary

The paper introduces Macaw, a T5-based QA system that achieves high performance in zero-shot tasks and surpasses larger models like GPT-3 on challenging datasets.
Macaw leverages a multi-angle training approach with flexible text-to-text transformations to handle diverse tasks such as generating explanations and reverse-engineering questions.
Empirical results show Macaw outperforms GPT-3 by over 10% on the Challenge300 dataset, demonstrating robust performance with fewer parameters.

Macaw: A T5-Based Multi-Angle Question-Answering System

The paper presents "Macaw," a general-purpose question-answering (QA) system designed to address gaps in the availability of high-quality, freely accessible QA systems. Developed on top of the UnifiedQA model and leveraging the T5 architecture, Macaw showcases robust performance in zero-shot contexts and competes strongly against larger models such as GPT-3, despite having fewer parameters.

System Architecture and Capabilities

Macaw is engineered on the principles of flexible text-to-text transformation, enabling multiple permutations or "angles" of inputs and outputs. This encompasses traditional question-to-answer transformations as well as more complex tasks, such as generating explanations, reverse engineering questions from answers, and formulating multiple-choice options.

The system is built in layers:

UnifiedQA Foundation: Macaw builds upon the UnifiedQA model, refining it with a focus on generative answer capabilities rather than span prediction.
Multi-Angle Training: Various training examples involve distinct input-output tasks, broadening the scope of the model's applicability.
T5 Integration: Utilizes the underlying T5 model, enhancing its capabilities with multi-dataset training to ensure robust performance.

Empirical Evaluation

Macaw underwent extensive evaluation on both structured datasets such as ARC and novel sets, such as the Challenge300. Notably, it surpasses GPT-3 by over 10% on Challenge300, even though it is significantly smaller (11B vs. 175B parameters). This dataset was curated to challenge diverse reasoning skills, and Macaw exhibited a superior understanding of categories like commonsense and hypothetical reasoning.

Observations and Limitations

While Macaw performs impressively in various domains, certain limitations remain. The paper points out struggles with false presuppositions, arithmetic tasks, and complex spatial reasoning. These weaknesses highlight ongoing challenges in pretrained transformer models where nuanced or step-wise logic is necessary.

Implications and Future Work

Macaw's architecture and performance demonstrate an intriguing avenue for future advancements in QA systems. The findings emphasize the capacity of smaller, specialized models to rival larger counterparts in certain scenarios. This can influence strategies in model deployment for efficiency without compromising on capabilities.

The paper proposes further investigation into refining question types where Macaw struggles. Addressing these could enhance its utility across broader applications, such as educational tools, AI-driven tutoring, and interactive agents.

Conclusion

Macaw underscores the potential of leveraging T5's flexible architecture to build versatile, high-performing QA systems. By supporting multiple input-output formats, it proves to be a valuable asset for the community, setting a benchmark in balancing model size with capability. The release of Macaw offers a robust tool for future inquiry, promoting further innovation in general-purpose AI systems.