TaskLAMA: Probing the Complex Task Understanding of Language Models

Published 29 Aug 2023 in cs.CL | (2308.15299v1)

Abstract: Structured Complex Task Decomposition (SCTD) is the problem of breaking down a complex real-world task (such as planning a wedding) into a directed acyclic graph over individual steps that contribute to achieving the task, with edges specifying temporal dependencies between them. SCTD is an important component of assistive planning tools, and a challenge for commonsense reasoning systems. We probe how accurately SCTD can be done with the knowledge extracted from LLMs. We introduce a high-quality human-annotated dataset for this problem and novel metrics to fairly assess performance of LLMs against several baselines. Our experiments reveal that LLMs are able to decompose complex tasks into individual steps effectively, with a relative improvement of 15% to 280% over the best baseline. We also propose a number of approaches to further improve their performance, with a relative improvement of 7% to 37% over the base model. However, we find that LLMs still struggle to predict pairwise temporal dependencies, which reveals a gap in their understanding of complex tasks.

Abstract PDF Upgrade to Chat

Citations (7)

View on Semantic Scholar

Summary

The paper introduces TaskLAMA, a novel dataset that benchmarks LLMs' ability to decompose complex tasks into context-driven steps with defined temporal dependencies.
It employs innovative Hungarian matching metrics to evaluate both step generation and temporal ordering, addressing limitations of traditional precision measures.
The study shows that while LLMs improve step generation by 15%-280%, they still struggle with accurately predicting the correct order of task steps.

Probing Complex Task Understanding with TaskLAMA

The paper "TaskLAMA: Probing the Complex Task Understanding of LLMs" (2308.15299) introduces a new dataset and methodology for evaluating the ability of LLMs to perform SCTD. This task involves breaking down complex, real-world tasks into a DAG of individual steps, with edges representing temporal dependencies. The authors highlight the importance of SCTD as a component of assistive planning tools and a challenge for commonsense reasoning systems. The paper presents TaskLAMA, a high-quality, human-annotated dataset, along with novel metrics designed for fair assessment of LLMs against several baselines.

TaskLAMA Dataset Construction

The TaskLAMA dataset comprises 1612 tasks, each annotated with:

Contextual assumptions to provide a setting for the task.
A set of required steps for completing the task under the specified context.
Temporal dependencies between the steps, forming a task graph.

This resulted in a collection of 12118 steps and 11105 temporal dependencies. The complex tasks were sourced from the MSComplexTasks dataset and popular "How To" search queries. Annotators were instructed to provide steps in their own words, ensuring each step started with a verb, represented a single action, and was actionable within the given context. The dataset was split into training, validation, and test sets, with conceptual differences maintained across the sets to ensure a diverse evaluation.

Methodology for Task Graph Generation and Evaluation

The paper explores various strategies for generating task graphs using LLMs, focusing on two key aspects: generating the steps and determining the order dependencies. For step generation, the following approaches were evaluated: ICL, MultSeq, S{content}F, SPT, and combinations thereof. To transform the generated sequences into task graphs, several approaches were considered: Linear, ICL, ICL with CoT, SPT, and LLM Scoring.

Figure 1: An example of a task graph for a complex task from the TaskLAMA dataset.

The evaluation metrics included standard measures like Rouge scores, but the authors identified a flaw in existing precision and recall metrics that could be exploited by simply adding duplicate sub-steps. To address this, they introduced metrics based on Hungarian matching to enforce a one-to-one mapping between generated and golden steps. They also reported a relaxed version of Hungarian matching allowing one-to-two mappings to account for cases where a single golden step might correspond to multiple generated steps. For edge evaluation, the paper introduces metrics like In-Degree, Out-Degree, and step proximity, all based on Rouge score overlap between parent and child nodes in the generated and ground truth graphs.

Results and Analysis

The study's findings indicate that LLMs excel at decomposing complex tasks into individual steps. Specifically, off-the-shelf LLMs demonstrated a $15\%-280\%$ relative improvement over the best baseline in step generation. Furthermore, the paper shows that LLMs can understand the context and adapt the generated steps accordingly. The performance with and without context is shown in (Figure 2).

Figure 2: The model performance with and without the context provided as input.

However, the results also reveal a significant limitation: LLMs struggle to accurately predict pairwise temporal dependencies between steps. While they can generate good sequences of steps, their ability to determine the correct order remains unsatisfactory. The paper presents a qualitative analysis of LLM-generated task graphs, highlighting instances where the steps and dependencies are sensible, but also pointing out potential issues and inaccuracies. An example of the model outputs with and without the context provided to the model is shown in (Figure 3).

Figure 3: An example of model outputs for a task with and without the context provided as input to the model.

Implications and Future Directions

The research has notable implications for the development of AI-powered assistive planning tools. The TaskLAMA dataset and the proposed metrics provide a valuable resource for benchmarking and improving LLMs' ability to understand and decompose complex tasks. The finding that LLMs struggle with temporal dependency prediction highlights a critical area for future research. Potential avenues for improvement include:

Developing new approaches to enhance LLMs' temporal reasoning abilities.
Exploring methods for generating entire task graphs at once, rather than sequentially.
Employing recursive decomposition strategies to break down complex tasks into simpler sub-tasks.

The paper concludes by acknowledging the limitations of the current work, such as the lack of conditional task graphs, the challenges of granularity in task graph generation, and the use of similarity-based metrics instead of entailment models. These limitations provide a clear roadmap for future research aimed at advancing the state-of-the-art in structured complex task decomposition.

Markdown