DIALIGHT: Lightweight Multilingual Development and Evaluation of Task-Oriented Dialogue Systems with Large Language Models (2401.02208v1)

Published 4 Jan 2024 in cs.CL

Abstract: We present DIALIGHT, a toolkit for developing and evaluating multilingual Task-Oriented Dialogue (ToD) systems which facilitates systematic evaluations and comparisons between ToD systems using fine-tuning of Pretrained LLMs (PLMs) and those utilising the zero-shot and in-context learning capabilities of LLMs. In addition to automatic evaluation, this toolkit features (i) a secure, user-friendly web interface for fine-grained human evaluation at both local utterance level and global dialogue level, and (ii) a microservice-based backend, improving efficiency and scalability. Our evaluations reveal that while PLM fine-tuning leads to higher accuracy and coherence, LLM-based systems excel in producing diverse and likeable responses. However, we also identify significant challenges of LLMs in adherence to task-specific instructions and generating outputs in multiple languages, highlighting areas for future research. We hope this open-sourced toolkit will serve as a valuable resource for researchers aiming to develop and properly evaluate multilingual ToD systems and will lower, currently still high, entry barriers in the field.

References (51)

Citations (2)

View on Semantic Scholar

Summary

The paper presents DIALIGHT as a toolkit that streamlines building and comparing multilingual task-oriented dialogue systems using both fine-tuning of PLMs and zero-shot LLM approaches.
It employs a dual evaluation strategy by integrating automatic metrics like Joint Goal Accuracy, BLEU, and METEOR with detailed human assessments through an intuitive web interface.
Findings indicate that while PLM fine-tuned systems offer higher accuracy and coherence, LLM-based systems generate more diverse responses, highlighting trade-offs in multilingual performance.

Introduction to \toolkit

The development and evaluation of Task-Oriented Dialogue (ToD) systems are crucial for creating efficient and user-friendly AI-driven conversational agents. In light of this, researchers have introduced \toolkit, a novel toolkit, to streamline the process of building and benchmarking multilingual ToD systems. This toolkit is engineered to facilitate comparisons between systems that fine-tune Pretrained LLMs (PLMs) and those incorporating the more recent method of leveraging the zero-shot and in-context learning capabilities of LLMs.

\toolkit Features and Capabilities

One of the most notable features of \toolkit is its dual-focused evaluation methodology that combines automatic and human evaluation metrics. The automatic evaluation covers a variety of benchmarks, including Joint Goal Accuracy, BLEU, and METEOR scores, among others. The human evaluation aspect is further bolstered with a secure and intuitive web interface that allows assessments at both utterance and dialogue levels, ensuring a granular and holistic analysis of ToD systems.

Crucially, this toolkit supports multilingual development, bringing the ability to evaluate systems in languages such as Arabic, French, and Turkish, in addition to English. This is a significant step toward addressing the performance disparities observed in non-English ToD systems. It also leverages a microservice-based backend, which enhances efficiency and scalability, making it a robust resource for researchers.

Comparative Analysis of ToD Systems

The toolkit has already been employed in carrying out systematic evaluations. The findings suggest that while ToD systems fine-tuned on specific PLMs generally display higher accuracy and coherence, LLM-based systems excel in generating more diverse and likable responses. However, LLMs present their own set of challenges, particularly when it comes to faithfully following task-specific instructions and providing multilingual outputs.

Looking Forward

The introduction of \toolkit is poised to lower entry barriers in the field and provide valuable insights into the development of ToD systems. While \toolkit allows for in-depth comparative research and could pave the way for improvements in multilingual ToD systems, the gaps identified in current research highlight the need for future studies that refine the use of LLMs, especially in tasks that require strict adherence to guidelines in diverse linguistic contexts.

The toolkit is an open-source resource, and the creators encourage adaptation and contributions from the broader research community to extend its capabilities and applications. With the groundwork laid by \toolkit, the field of conversational AI is positioned for exciting developments, particularly in building systems that can interact effectively across numerous languages and cultures.

PDF Markdown

Tweets

https://twitter.com/hu_songbo/status/1747620636258660659

https://twitter.com/hu_songbo/status/1747589678742536699