Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 60 tok/s

Gemini 2.5 Pro 54 tok/s Pro

GPT-5 Medium 30 tok/s Pro

GPT-5 High 35 tok/s Pro

GPT-4o 99 tok/s Pro

Kimi K2 176 tok/s Pro

GPT OSS 120B 448 tok/s Pro

Claude Sonnet 4.5 37 tok/s Pro

2000 character limit reached

Towards a Benchmark for Large Language Models for Business Process Management Tasks (2410.03255v2)

Published 4 Oct 2024 in cs.AI and cs.CL

Abstract: An increasing number of organizations are deploying LLMs for a wide range of tasks. Despite their general utility, LLMs are prone to errors, ranging from inaccuracies to hallucinations. To objectively assess the capabilities of existing LLMs, performance benchmarks are conducted. However, these benchmarks often do not translate to more specific real-world tasks. This paper addresses the gap in benchmarking LLM performance in the Business Process Management (BPM) domain. Currently, no BPM-specific benchmarks exist, creating uncertainty about the suitability of different LLMs for BPM tasks. This paper systematically compares LLM performance on four BPM tasks focusing on small open-source models. The analysis aims to identify task-specific performance variations, compare the effectiveness of open-source versus commercial models, and assess the impact of model size on BPM task performance. This paper provides insights into the practical applications of LLMs in BPM, guiding organizations in selecting appropriate models for their specific needs.

Summary

The paper presents a structured evaluation of LLMs for BPM tasks by analyzing activity recommendation, RPA candidate identification, process question answering, and declarative process model mining.
It compares both open-source and commercial models using metrics like precision, recall, F1-score, cosine similarity, and ROUGE-L to reveal significant performance variations.
The study highlights the crucial impact of prompt engineering, showing that tailored prompting strategies improve model outputs in BPM-specific applications.

Benchmarking LLMs for BPM Tasks

The paper presents a structured evaluation of LLMs specifically in the domain of Business Process Management (BPM). Unlike existing NLP benchmarks that focus on generic tasks, this paper offers insights into BPM-specific tasks, an area previously lacking a standardized evaluation framework. By examining both open-source and commercial models on four distinct BPM tasks, the paper seeks to understand performance variations related to model size, topology, and licensing.

Introduction

The paper recognizes the prevalent deployment of LLMs in organizational contexts for tasks ranging from process analysis to prediction. Given the critical role of BPM in organizational efficiency, the necessity for a tailored benchmark to evaluate and compare LLM suitability for BPM is highlighted. Existing benchmarks, although robust in evaluating general-purpose NLP tasks, do not sufficiently account for the unique characteristics and requirements of BPM applications. This research aims to bridge that gap.

BPM Tasks Analyzed

Activity Recommendation

The task involves predicting subsequent activities in BPM sequences. It is approached as a sequence prediction problem, where historical activity sequences are used to recommend subsequent actions.

Identifying RPA Candidates

Identifies tasks suitable for Robotic Process Automation (RPA) by classifying them into manual, user, or automated categories. This classification helps in pinpointing automation opportunities.

Process Question Answering

Entails understanding textual descriptions of processes and answering related queries. This task emphasizes comprehension and information extraction capabilities of LLMs.

Mining Declarative Process Models

Involves extracting process constraints from textual descriptions to generate flexible, declarative process models. This task tests the ability of LLMs to understand and formalize process rules.

Experimental Setup

Datasets and Models

Three specific datasets were utilized for the BPM tasks, derived from sources like the SAP Signavio Academic Models collection and other documented process descriptions. The models selected for evaluation included both cutting-edge open-source and commercial LLMs, such as GPT-4, Llama 3, and several medium-weight alternatives.

Metrics and Prompting Strategies

Evaluation metrics varied by task but focused on precision, recall, F1-score, cosine similarity, and ROUGE-L for text-based tasks. Prompting strategies involved few-shot examples, persona definitions, and step-by-step instructions to guide the models.

Results

The evaluation revealed diverse performance trends:

General Model Performance: GPT-4 and Llama3-7b showed superior performance, consistently ranking high across tasks, particularly excelling in process question answering.
Task-Specific Strengths: Models like Claude2-13b demonstrated particular aptitude for activity recommendation, while Mixtral-8x7b excelled in extracting information for process question answering.
Effect of Prompting: Performance varied significantly with different prompting strategies, underscoring the importance of prompt design in obtaining reliable outputs.
Figure 1: Overview of normalized performance results for each BPM task and inference time (with negative sign applied), where higher values indicate better performance. The results are based on few-shot prompting.

Figure 2: Performance gain/loss by LLMs and prompt pattern (difference from persona baseline).

Implications

The findings have various implications for practice and research. The analysis reveals that while some open-source models compete well with GPT-4 in efficiency and task specialization, model selection should be carefully aligned with specific BPM needs to maximize efficiency and performance. Prompt engineering also emerges as a crucial factor influencing model output quality.

Conclusion

The paper provides a foundational benchmark for examining LLMs on BPM tasks. The insights underscore the necessity for domain-specific evaluations, demonstrating that LLM performance in BPM can be highly task-dependent. Future work should focus on expanding this benchmark to include additional tasks and developing more refined models for BPM, alongside refining prompt strategies for maximal performance yield.