Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DIALIGHT: Lightweight Multilingual Development and Evaluation of Task-Oriented Dialogue Systems with Large Language Models (2401.02208v1)

Published 4 Jan 2024 in cs.CL

Abstract: We present DIALIGHT, a toolkit for developing and evaluating multilingual Task-Oriented Dialogue (ToD) systems which facilitates systematic evaluations and comparisons between ToD systems using fine-tuning of Pretrained LLMs (PLMs) and those utilising the zero-shot and in-context learning capabilities of LLMs. In addition to automatic evaluation, this toolkit features (i) a secure, user-friendly web interface for fine-grained human evaluation at both local utterance level and global dialogue level, and (ii) a microservice-based backend, improving efficiency and scalability. Our evaluations reveal that while PLM fine-tuning leads to higher accuracy and coherence, LLM-based systems excel in producing diverse and likeable responses. However, we also identify significant challenges of LLMs in adherence to task-specific instructions and generating outputs in multiple languages, highlighting areas for future research. We hope this open-sourced toolkit will serve as a valuable resource for researchers aiming to develop and properly evaluate multilingual ToD systems and will lower, currently still high, entry barriers in the field.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan. Association for Computational Linguistics.
  2. Learning end-to-end goal-oriented dialog. In International Conference on Learning Representations.
  3. What does it mean for a language model to preserve privacy? In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’22, page 2280–2292, New York, NY, USA. Association for Computing Machinery.
  4. MultiWOZ - a large-scale multi-domain Wizard-of-Oz dataset for task-oriented dialogue modelling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 5016–5026, Brussels, Belgium. Association for Computational Linguistics.
  5. Taskmaster-1: Toward a realistic and diverse dialog dataset. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4516–4525, Hong Kong, China. Association for Computational Linguistics.
  6. Extracting training data from large language models. In USENIX Security Symposium, volume 6.
  7. Instructtods: Large language models for end-to-end task-oriented dialogue systems. arXiv preprint arXiv:2310.08885.
  8. GlobalWoZ: Globalizing MultiWoZ to develop multilingual task-oriented dialogue systems. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1639–1657, Dublin, Ireland. Association for Computational Linguistics.
  9. Key-value retrieval networks for task-oriented dialogue. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, pages 37–49, Saarbrücken, Germany. Association for Computational Linguistics.
  10. Mihail Eric and Christopher Manning. 2017. A copy-augmented sequence-to-sequence architecture gives good performance on task-oriented dialogue. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 468–473, Valencia, Spain. Association for Computational Linguistics.
  11. The at&t spoken language understanding system. IEEE Transactions on Speech and Audio Processing, 14(1):213–222.
  12. Galaxy: A generative pre-trained model for task-oriented dialog with semi-supervised learning and explicit policy injection. Proceedings of the AAAI Conference on Artificial Intelligence.
  13. ChatGPT for zero-shot dialogue state tracking: A solution or an opportunity? In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 936–950, Toronto, Canada. Association for Computational Linguistics.
  14. Multi 3 WOZ: A multilingual, multi-domain, multi-parallel dataset for training and evaluating culturally adapted task-oriented dialog systems. Transactions of the Association for Computational Linguistics, 11:1396–1415.
  15. A systematic study of performance disparities in multilingual task-oriented dialogue systems. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6825–6851, Singapore. Association for Computational Linguistics.
  16. Vojtěch Hudeček and Ondrej Dusek. 2023. Are large language models all you need for task-oriented dialogue? In Proceedings of the 24th Meeting of the Special Interest Group on Discourse and Dialogue, pages 216–228, Prague, Czechia. Association for Computational Linguistics.
  17. DialCrowd 2.0: A quality-focused dialog system crowdsourcing toolkit. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 1256–1263, Marseille, France. European Language Resources Association.
  18. DialCrowd: A toolkit for easy dialog system assessment. In Proceedings of the 19th Annual SIGdial Meeting on Discourse and Dialogue, pages 245–248, Melbourne, Australia. Association for Computational Linguistics.
  19. ConvLab: Multi-domain end-to-end dialog system platform. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 64–69, Florence, Italy. Association for Computational Linguistics.
  20. Sequicity: Simplifying task-oriented dialogue systems with single sequence-to-sequence architectures. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1437–1447, Melbourne, Australia. Association for Computational Linguistics.
  21. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online. Association for Computational Linguistics.
  22. Bactrian-X: A multilingual replicable instruction-following model with low-rank adaptation. CoRR, abs/2305.15011.
  23. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
  24. MinTL: Minimalist transfer learning for task-oriented dialogue systems. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 3391–3405, Online. Association for Computational Linguistics.
  25. Report from the nsf future directions workshop on automatic evaluation of dialog: Research directions and challenges.
  26. Shikib Mehri and Maxine Eskenazi. 2020. Unsupervised evaluation of interactive dialog with DialoGPT. In Proceedings of the 21th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 225–235, 1st virtual meeting. Association for Computational Linguistics.
  27. Shuyo Nakatani. 2010. Language detection library for java.
  28. Tomáš Nekvinda and Ondřej Dušek. 2021. Shades of BLEU, flavours of success: The case of MultiWOZ. In Proceedings of the 1st Workshop on Natural Language Generation, Evaluation, and Metrics (GEM 2021), pages 34–46, Online. Association for Computational Linguistics.
  29. OpenAI. 2023. Gpt-4 technical report. ArXiv, abs/2303.08774.
  30. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  31. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
  32. Soloist: Building task bots at scale with transfer learning and machine teaching. Transactions of the Association for Computational Linguistics, 9:807–824.
  33. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(1).
  34. Crossing the conversational chasm: A primer on natural language processing for multilingual task-oriented dialogue systems. Journal of Artificial Intelligence Research, 74:1351–1402.
  35. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research.
  36. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  37. The CALO meeting assistant system. IEEE Transactions on Speech Audio Processing, 18(6):1601–1611.
  38. PyDial: A Multi-domain Statistical Dialogue System Toolkit. In Proceedings of ACL 2017, System Demonstrations, pages 73–78, Vancouver, Canada. Association for Computational Linguistics.
  39. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems. Curran Associates Inc., Red Hook, NY, USA.
  40. Openchat: Advancing open-source language models with mixed-quality data. arXiv preprint arXiv:2309.11235.
  41. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. arXiv preprint arXiv:2204.07705.
  42. A network-based end-to-end trainable task-oriented dialogue system. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 438–449, Valencia, Spain. Association for Computational Linguistics.
  43. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
  44. Using textual interface to align external knowledge for end-to-end task-oriented dialogue systems. arXiv preprint arXiv:2305.13710.
  45. mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483–498, Online. Association for Computational Linguistics.
  46. A comprehensive assessment of dialog evaluation metrics. In The First Workshop on Evaluations and Assessments of Neural Conversation Systems, pages 15–33, Online. Association for Computational Linguistics.
  47. Steve Young. 2007. Cued standard dialogue acts. Report, Cambridge University Engineering Department, 14th October, 2007.
  48. Steve J. Young. 2010. Cognitive user interfaces. IEEE Signal Processing Magazine, 27(3):128–140.
  49. SGP-TOD: Building task bots effortlessly via schema-guided LLM prompting. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 13348–13369, Singapore. Association for Computational Linguistics.
  50. Convlab-3: A flexible dialogue system toolkit based on a unified data format. arXiv preprint arXiv:2211.17148.
  51. ConvLab-2: An open-source toolkit for building, evaluating, and diagnosing dialogue systems. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 142–149, Online. Association for Computational Linguistics.
Citations (2)

Summary

  • The paper presents DIALIGHT as a toolkit that streamlines building and comparing multilingual task-oriented dialogue systems using both fine-tuning of PLMs and zero-shot LLM approaches.
  • It employs a dual evaluation strategy by integrating automatic metrics like Joint Goal Accuracy, BLEU, and METEOR with detailed human assessments through an intuitive web interface.
  • Findings indicate that while PLM fine-tuned systems offer higher accuracy and coherence, LLM-based systems generate more diverse responses, highlighting trade-offs in multilingual performance.

Introduction to \toolkit

The development and evaluation of Task-Oriented Dialogue (ToD) systems are crucial for creating efficient and user-friendly AI-driven conversational agents. In light of this, researchers have introduced \toolkit, a novel toolkit, to streamline the process of building and benchmarking multilingual ToD systems. This toolkit is engineered to facilitate comparisons between systems that fine-tune Pretrained LLMs (PLMs) and those incorporating the more recent method of leveraging the zero-shot and in-context learning capabilities of LLMs.

\toolkit Features and Capabilities

One of the most notable features of \toolkit is its dual-focused evaluation methodology that combines automatic and human evaluation metrics. The automatic evaluation covers a variety of benchmarks, including Joint Goal Accuracy, BLEU, and METEOR scores, among others. The human evaluation aspect is further bolstered with a secure and intuitive web interface that allows assessments at both utterance and dialogue levels, ensuring a granular and holistic analysis of ToD systems.

Crucially, this toolkit supports multilingual development, bringing the ability to evaluate systems in languages such as Arabic, French, and Turkish, in addition to English. This is a significant step toward addressing the performance disparities observed in non-English ToD systems. It also leverages a microservice-based backend, which enhances efficiency and scalability, making it a robust resource for researchers.

Comparative Analysis of ToD Systems

The toolkit has already been employed in carrying out systematic evaluations. The findings suggest that while ToD systems fine-tuned on specific PLMs generally display higher accuracy and coherence, LLM-based systems excel in generating more diverse and likable responses. However, LLMs present their own set of challenges, particularly when it comes to faithfully following task-specific instructions and providing multilingual outputs.

Looking Forward

The introduction of \toolkit is poised to lower entry barriers in the field and provide valuable insights into the development of ToD systems. While \toolkit allows for in-depth comparative research and could pave the way for improvements in multilingual ToD systems, the gaps identified in current research highlight the need for future studies that refine the use of LLMs, especially in tasks that require strict adherence to guidelines in diverse linguistic contexts.

The toolkit is an open-source resource, and the creators encourage adaptation and contributions from the broader research community to extend its capabilities and applications. With the groundwork laid by \toolkit, the field of conversational AI is positioned for exciting developments, particularly in building systems that can interact effectively across numerous languages and cultures.