TACT: Advancing Complex Aggregative Reasoning with Information Extraction Tools (2406.03618v3)

Published 5 Jun 2024 in cs.CL

Abstract: LLMs often do not perform well on queries that require the aggregation of information across texts. To better evaluate this setting and facilitate modeling efforts, we introduce TACT - Text And Calculations through Tables, a dataset crafted to evaluate LLMs' reasoning and computational abilities using complex instructions. TACT contains challenging instructions that demand stitching information scattered across one or more texts, and performing complex integration on this information to generate the answer. We construct this dataset by leveraging an existing dataset of texts and their associated tables. For each such tables, we formulate new queries, and gather their respective answers. We demonstrate that all contemporary LLMs perform poorly on this dataset, achieving an accuracy below 38%. To pinpoint the difficulties and thoroughly dissect the problem, we analyze model performance across three components: table-generation, Pandas command-generation, and execution. Unexpectedly, we discover that each component presents substantial challenges for current LLMs. These insights lead us to propose a focused modeling framework, which we refer to as IE as a tool. Specifically, we propose to add "tools" for each of the above steps, and implement each such tool with few-shot prompting. This approach shows an improvement over existing prompting techniques, offering a promising direction for enhancing model capabilities in these tasks.

PDF HTML Abstract

Insights into TACT: A Dataset for Complex Aggregative Reasoning

The paper "TACT: Advancing Complex Aggregative Reasoning with Information Extraction Tools" addresses a notable challenge in the domain of LLMs: their underperformance on tasks requiring reasoning and aggregation of information across texts. The authors present TACT (Text And Calculations through Tables), a dataset specifically crafted to evaluate and enhance the reasoning and computational abilities of LLMs. TACT encompasses difficult queries that necessitate integrating scattered text information and performing complex calculations to derive responses.

Dataset and Methodology

TACT stems from the InstructIE dataset, which comprises texts and their associated structured tables. In contrast to InstructIE, TACT focuses on aggregative queries requiring the synthesis of information from multiple text sources. The dataset consists of 124 instances, each including:

Original text and table from InstructIE.
A numerical instruction framed in natural language.
A corresponding Pandas command to perform the required calculation.
The expected result after executing the Pandas command.

The dataset creation process involved domain experts who judiciously selected text-table pairs, formulated meaningful numeric instructions, translated these into executable Pandas commands, and validated the results. This meticulous process ensures the benchmark's robustness and relevance.

Performance Evaluation and Challenges

The paper highlights the surprisingly low performance of contemporary LLMs on TACT, with accuracy rates not exceeding 38%. To dissect the challenges, the authors decompose the problem into three tasks: table generation, Pandas command generation, and command execution. They report significant difficulties in all three components, indicating substantial room for improvement.

Model Performance Analysis

The evaluation included various state-of-the-art models such as Gemini-1.0-Ultra, Gemini-1.5-Pro, Llama-2-13b-chat, Llama-2-70b-chat, Gemma-7b-it, Mistral-7b-instruct, and Mixtral 8x7B. The results showcased underwhelming performance, especially on tasks that required integrating information across multiple contexts—a fundamental hurdle for TACT.

Proposed Solution: IE as a Tool Framework

The authors propose an innovative modeling framework termed IE as a tool, where each component of the task is tackled separately using few-shot prompted tools. Specifically:

Generation of relevant tables from the text.
Creation of appropriate Pandas commands to execute desired calculations.
Integration of these components to produce the final answer.

This approach showed promise, with performance improvements of up to 12% over conventional prompting techniques. The proposed framework effectively leverages decomposed reasoning steps to enhance LLM capabilities in handling complex queries.

Implications and Future Directions

The introduction of TACT provides a crucial benchmark for evaluating and advancing LLM reasoning and computational abilities. The rigorous analysis of model strengths and weaknesses across different subtasks offers valuable insights into specific areas needing improvement. The proposed IE as a tool framework exemplifies an effective strategy for enhancing model performance, suggesting that tailored tool integration could significantly advance the field of computational reasoning.

Future research could explore more sophisticated implementations of IE tools, focusing on reducing error propagation through sequential components. Additionally, expanding the dataset to encompass a broader range of numerical reasoning tasks could further test and refine LLM capabilities.

Conclusion

The TACT dataset and the accompanying research provide a critical lens through which the limitations of current LLMs can be analyzed and addressed. This work underscores the importance of targeted benchmarks and innovative frameworks like IE as a tool in pushing the boundaries of complex reasoning and aggregation tasks. The insights garnered from this research pave the way for future advancements in AI, particularly in enhancing the computational and reasoning capacities of LLMs through specialized tools and frameworks.