Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Large Language Models as Zero-shot Dialogue State Tracker through Function Calling (2402.10466v4)

Published 16 Feb 2024 in cs.CL and cs.AI
Large Language Models as Zero-shot Dialogue State Tracker through Function Calling

Abstract: LLMs are increasingly prevalent in conversational systems due to their advanced understanding and generative capabilities in general contexts. However, their effectiveness in task-oriented dialogues (TOD), which requires not only response generation but also effective dialogue state tracking (DST) within specific tasks and domains, remains less satisfying. In this work, we propose a novel approach FnCTOD for solving DST with LLMs through function calling. This method improves zero-shot DST, allowing adaptation to diverse domains without extensive data collection or model tuning. Our experimental results demonstrate that our approach achieves exceptional performance with both modestly sized open-source and also proprietary LLMs: with in-context prompting it enables various 7B or 13B parameter models to surpass the previous state-of-the-art (SOTA) achieved by ChatGPT, and improves ChatGPT's performance beating the SOTA by 5.6% average joint goal accuracy (JGA). Individual model results for GPT-3.5 and GPT-4 are boosted by 4.8% and 14%, respectively. We also show that by fine-tuning on a small collection of diverse task-oriented dialogues, we can equip modestly sized models, specifically a 13B parameter LLaMA2-Chat model, with function-calling capabilities and DST performance comparable to ChatGPT while maintaining their chat capabilities. We have made the code publicly available at https://github.com/facebookresearch/FnCTOD

Leveraging LLMs for Zero-shot Dialogue State Tracking via Function Calling

Introduction to FnCTOD Approach

The novel FnCTOD approach endeavors to harness the potent capabilities of LLMs for zero-shot dialogue state tracking (DST) by introducing function calling within conversational contexts. This strategy circumvents the necessity for extensive data collection and model re-training for task-oriented dialogues (TOD), addressing a significant bottleneck in deploying conversational systems across diverse domains. By embedding function specifications into the dialogues as system prompts, FnCTOD enables LLMs to generate both dialogue states and responses seamlessly, marking a critical advance in making versatile conversational systems practical and scalable.

Key Contributions and Results

The paper delineates several vital contributions through the FnCTOD methodology. Firstly, it showcases the ability of FnCTOD to significantly enhance the performance of both modestly-sized open-source and proprietary LLMs through in-context prompting. Notably, the approach sets a new benchmark by improving the performance of GPT-4 by 14%, establishing a new state-of-the-art for zero-shot DST. Moreover, it bridges the performance gap between open-source models and ChatGPT by fine-tuning a 13B LLaMA2-Chat model on a diversified set of task-oriented dialogues, thereby maintaining chat capabilities while imbuing the model with function-calling DST capacities.

Empirical Validation

The experimental validation conducted on the MultiWOZ benchmark illustrates FnCTOD’s efficacy in enhancing DST performance without further fine-tuning across various open-source and proprietary models. The approach significantly outperforms existing state-of-the-art methods, demonstrating substantial performance improvements - a 5.6% average JGA increment over the prior benchmarks with GPT-3.5 and a remarkable 14% with GPT-4. Additionally, the fine-tuned 13B parameter LLaMA2-Chat model exhibits comparable performance with ChatGPT, underscoring the approach’s utility in upgrading moderately sized models for zero-shot DST tasks.

Methodological Insights

FnCTOD redefines DST as a function calling task, effectively converting domain schemas into function specifications embedded within dialogue prompts. This novel formulation facilitates LLMs in generating function calls aligned with dialogue state requirements seamlessly. Incorporating function call decomposition and leveraging in-context prompting, the methodology distinctly improves over non-decomposed methods, emphasizing the merit of fine-tuning with a manageable dataset size for optimal zero-shot generalization capabilities.

Theoretical and Practical Implications

From a theoretical standpoint, FnCTOD advances our understanding of leveraging LLMs for task-specific functions without the stringent need for domain-specific training data, enhancing the adaptability of conversational systems. Practically, the approach paves the way for scalable and efficient deployment of chatbots and virtual assistants across myriad domains, significantly reducing the overhead associated with model training and data annotation for new domains.

Future Directions

While FnCTOD posits a robust framework for incorporating DST in TOD systems through LLMs, the pursuit towards achieving higher accuracy for practical deployment remains. Future advancements in LLM capabilities, coupled with methodological refinements in FnCTOD, are anticipated to further augment performance. Moreover, developing more realistic evaluation protocols for TOD systems, especially concerning response generation, will be crucial in realizing the full potential of such conversational models in real-world applications.

Concluding Remarks

FnCTOD represents a pivotal step forward in the quest to utilize LLMs for the dynamic and diverse field of task-oriented dialogues. By enabling zero-shot DST through function calling, this approach mitigates significant barriers to deploying conversational systems across various domains, offering a blueprint for future innovations in the field of conversational AI.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Baichuan. 2023. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305.
  2. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023.
  3. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  4. Multiwoz–a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling. arXiv preprint arXiv:1810.00278.
  5. Taskmaster-1: Toward a realistic and diverse dialog dataset. arXiv preprint arXiv:1909.05358.
  6. Zero-shot transfer learning with synthesized data for multi-domain dialogue state tracking. arXiv preprint arXiv:2005.00891.
  7. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  8. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113.
  9. Instructtods: Large language models for end-to-end task-oriented dialogue systems. arXiv preprint arXiv:2310.08885.
  10. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
  11. MultiWOZ 2.1: A consolidated multi-domain dialogue dataset with state corrections and state tracking baselines. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 422–428, Marseille, France. European Language Resources Association.
  12. Chatgpt for zero-shot dialogue state tracking: A solution or an opportunity? arXiv preprint arXiv:2306.01386.
  13. Trippy: A triple copy strategy for value independent neural dialog state tracking. arXiv preprint arXiv:2005.02877.
  14. A simple language model for task-oriented dialogue. Advances in Neural Information Processing Systems, 33:20179–20191.
  15. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
  16. In-context learning for few-shot dialogue state tracking. arXiv preprint arXiv:2203.08568.
  17. Vojtěch Hudeček and Ondřej Dušek. 2023. Are llms all you need for task-oriented dialogue? arXiv preprint arXiv:2304.06556.
  18. Mistral 7b. arXiv preprint arXiv:2310.06825.
  19. Ma-dst: Multi-attention-based scalable dialog state tracking. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 8107–8114.
  20. Sumbt: Slot-utterance matching for universal and scalable belief tracking. arXiv preprint arXiv:1907.07421.
  21. Api-bank: A comprehensive benchmark for tool-augmented llms. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3102–3116.
  22. Microsoft dialogue challenge: Building end-to-end task-completion dialogue systems. arXiv preprint arXiv:1807.11125.
  23. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval.
  24. Controllable dialogue simulation with in-context learning. arXiv preprint arXiv:2210.04185.
  25. Guiding large language models via directional stimulus prompting. arXiv preprint arXiv:2302.11520.
  26. Zero-shot dialogue state tracking via cross-task transfer. arXiv preprint arXiv:2109.04655.
  27. Zero-shot dialogue state tracking via cross-task transfer. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7890–7900, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  28. Leveraging slot descriptions for zero-shot cross-domain dialogue StateTracking. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5640–5648, Online. Association for Computational Linguistics.
  29. Mintl: Minimalist transfer learning for task-oriented dialogue systems. arXiv preprint arXiv:2009.12005.
  30. Neural belief tracker: Data-driven dialogue state tracking. arXiv preprint arXiv:1606.03777.
  31. OpenAI. 2023. Gpt-4 technical report.
  32. Talm: Tool augmented language models. arXiv preprint arXiv:2205.12255.
  33. Gorilla: Large language model connected with massive apis. arXiv preprint arXiv:2305.15334.
  34. Soloist: Few-shot task-oriented dialog with a single pretrained auto-regressive model. arXiv preprint arXiv:2005.05298, 3.
  35. Towards scalable multi-domain conversational agents: The schema-guided dialogue dataset. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 8689–8696.
  36. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761.
  37. Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580.
  38. Dialogue summaries as dialogue states (DS2), template-guided summarization for few-shot dialogue state tracking. In Findings of the Association for Computational Linguistics: ACL 2022, pages 3824–3846, Dublin, Ireland. Association for Computational Linguistics.
  39. Multi-task pre-training for plug-and-play task-oriented dialogue system. arXiv preprint arXiv:2109.14739.
  40. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  41. Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:2310.16944.
  42. Conditional generation and snapshot learning in neural dialogue systems. arXiv preprint: 1606.03352.
  43. A network-based end-to-end trainable task-oriented dialogue system. arXiv preprint: 1604.04562.
  44. Improving limited labeled dialogue state tracking with self-supervision. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4462–4472, Online. Association for Computational Linguistics.
  45. Transferable multi-domain state generator for task-oriented dialogue systems. arXiv preprint arXiv:1905.08743.
  46. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305.
  47. Multiwoz 2.2: A dialogue dataset with additional annotation corrections and state tracking baselines. arXiv preprint arXiv:2007.12720.
  48. Sgp-tod: Building task bots effortlessly via schema-guided llm prompting. arXiv preprint arXiv:2305.09067.
  49. Description-driven task-oriented dialog modeling. arXiv preprint arXiv:2201.08904.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Zekun Li (73 papers)
  2. Zhiyu Zoey Chen (9 papers)
  3. Mike Ross (6 papers)
  4. Patrick Huber (146 papers)
  5. Seungwhan Moon (28 papers)
  6. Zhaojiang Lin (45 papers)
  7. Xin Luna Dong (46 papers)
  8. Adithya Sagar (10 papers)
  9. Xifeng Yan (52 papers)
  10. Paul A. Crook (7 papers)
Citations (8)