Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 80 tok/s

Gemini 2.5 Pro 60 tok/s Pro

GPT-5 Medium 23 tok/s Pro

GPT-5 High 26 tok/s Pro

GPT-4o 87 tok/s Pro

Kimi K2 173 tok/s Pro

GPT OSS 120B 433 tok/s Pro

Claude Sonnet 4 36 tok/s Pro

2000 character limit reached

TELeR: A General Taxonomy of LLM Prompts for Benchmarking Complex Tasks (2305.11430v2)

Published 19 May 2023 in cs.AI, cs.CL, cs.IR, and cs.LG

Abstract: While LLMs have shown great success in understanding and generating text in traditional conversational settings, their potential for performing ill-defined complex tasks is largely under-studied. Indeed, we are yet to conduct comprehensive benchmarking studies with multiple LLMs that are exclusively focused on a complex task. However, conducting such benchmarking studies is challenging because of the large variations in LLMs' performance when different prompt types/styles are used and different degrees of detail are provided in the prompts. To address this issue, the paper proposes a general taxonomy that can be used to design prompts with specific properties in order to perform a wide range of complex tasks. This taxonomy will allow future benchmarking studies to report the specific categories of prompts used as part of the study, enabling meaningful comparisons across different studies. Also, by establishing a common standard through this taxonomy, researchers will be able to draw more accurate conclusions about LLMs' performance on a specific complex task.

Citations (37)

View on Semantic Scholar

Summary

The paper introduces TELeR, a taxonomy that categorizes prompts across dimensions like turn, expression, role, and detail to benchmark LLMs on complex tasks.
The study evaluates its framework using complex tasks like meta-review generation and narrative braiding, demonstrating practical applicability.
The proposed taxonomy offers a standardized methodology that enhances reproducibility and reliability in comparing LLM performance.

Overview of "TELeR: A General Taxonomy of LLM Prompts for Benchmarking Complex Tasks"

The paper "TELeR: A General Taxonomy of LLM Prompts for Benchmarking Complex Tasks" by Shubhra Kanti Karmaker and Dongji Feng tackles the under-explored area of benchmarking LLMs for complex tasks. Recognizing the critical role of prompts in guiding LLMs toward task-specific outputs, the authors propose a taxonomy, TELeR, to categorize and design prompts for complex tasks, thereby facilitating standardized benchmarking and comparison of LLM performances.

Context and Motivation

LLMs like GPT-3, Bard, and others have shown remarkable capabilities in handling well-defined, traditional NLP tasks. However, their efficacy with ill-defined, abstract, and complex tasks remains insufficiently studied and benchmarked. Variability in prompt styles and content levels poses a challenge, as it significantly affects LLM performance. Without a common framework, comparing different studies' findings remains problematic. Thus, the TELeR taxonomy aims to provide a structured way to design and classify prompts, promoting more accurate cross-paper evaluations.

Taxonomy Structure

The TELeR taxonomy categorizes prompts across four key dimensions:

Turn: Refers to the number of interaction turns with the LLM during task performance, allowing for single or multi-turn prompts.
Expression: Distinguishes between question-style and instruction-style directives within prompts.
Role: Captures whether a system role is pre-defined or undefined before prompting the LLM.
Level of Details: Encompasses a spectrum from minimal to highly detailed directives, parsed into seven levels (0 to 6). This includes key aspects such as goal clarity, sub-task distinction, explanation seeking, and the use of few-shot examples.

The taxonomy’s name, TELeR, is a mnemonic derived from these four factors: Turn, Expression, Level of Details, and Role.

Use Cases and Practical Implications

To demonstrate the TELeR taxonomy’s adaptability, the paper discusses two complex task scenarios: meta-review generation from peer-reviewer feedback and narrative braiding. These examples show how varying prompt detail levels impact task execution, validating the taxonomy’s potential to standardize prompt design and improve LLM evaluations across diverse tasks.

Implications and Future Applications

The proposed taxonomy holds significant promise for future research efforts involving LLMs in complex applications. It offers a standard methodology for designing and reporting prompts, enabling more meaningful and reliable comparisons in LLM benchmarking studies. Adoption of TELeR can streamline consensus building on state-of-the-art LLM capabilities, steering the community towards a unified understanding of LLM performance in complex task settings.

Conclusion

This work presents an impactful step towards the systematic paper and evaluation of LLMs for complex tasks, filling a crucial gap in prompt engineering research. While the TELeR taxonomy may need to adapt for non-complex tasks, its framework is robust and extensible, providing a foundation for future extensions and refinements as LLM capabilities and applications evolve.