Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Benchmarking Natural Language Understanding Services for building Conversational Agents (1903.05566v3)

Published 13 Mar 2019 in cs.CL and cs.LG

Abstract: We have recently seen the emergence of several publicly available Natural Language Understanding (NLU) toolkits, which map user utterances to structured, but more abstract, Dialogue Act (DA) or Intent specifications, while making this process accessible to the lay developer. In this paper, we present the first wide coverage evaluation and comparison of some of the most popular NLU services, on a large, multi-domain (21 domains) dataset of 25K user utterances that we have collected and annotated with Intent and Entity Type specifications and which will be released as part of this submission. The results show that on Intent classification Watson significantly outperforms the other platforms, namely, Dialogflow, LUIS and Rasa; though these also perform well. Interestingly, on Entity Type recognition, Watson performs significantly worse due to its low Precision. Again, Dialogflow, LUIS and Rasa perform well on this task.

Overview of Benchmarking Natural Language Understanding Services for Conversational Agents

The paper "Benchmarking Natural Language Understanding Services for building Conversational Agents" presents a comprehensive evaluation of several widely-used Natural Language Understanding (NLU) platforms. The authors, Xingkun Liu, Arash Eshghi, Pawel Swietojanski, and Verena Rieser, focus on comparing commercial tools like IBM Watson, Google's Dialogflow, Microsoft's LUIS, and the open-source Rasa. This evaluation is executed on a dataset of 25,716 user utterances, annotated for 64 intents and 54 entity types. The principal goal is to provide an unbiased, systematic analysis of the strengths and weaknesses of these NLU services, particularly for developers in need of effective tools for building Spoken Dialogue Systems (SDS) or conversational agents.

Methodology and Dataset

The dataset used in this evaluation is novel, comprising multi-domain utterances intended for a home assistant robot. The data reflects real user interactions, collected via Amazon Mechanical Turk, and covers a wide variety of tasks such as setting alarms, playing media, managing emails, and more. The annotations provide specifications for both intents and entity types within user utterances, with inter-annotator agreement reflecting moderate consistency (κ = 0.69). The evaluation uses 10-fold cross-validation and includes pairwise t-tests for the statistical analysis of platform performance.

Results

In terms of intent classification, IBM Watson significantly outperforms the other platforms with an F1 score of 0.882, illustrating its robustness in accurately mapping user utterances to intent labels. However, its performance on entity recognition is notably poor, primarily due to a high rate of false positives, leading to a substantially lower F1 score of 0.488. On the other hand, LUIS shows relatively balanced performance in both intent and entity recognition, achieving top scores in entity F1 due to its effective handling of Named Entity Recognition (NER).

The paper emphasizes that while Watson demonstrates superior intent classification, its entity recognition is hindered by precision issues. In contrast, Dialogflow and Rasa exhibit comparable, though slightly lower, performances across both tasks. The evaluation highlights pragmatic considerations for developers in selecting appropriate NLU tools, such as the need for addressing multiple intents or utilizing dialogue context for improved performance, features that are not natively supported by the reviewed platforms.

Implications and Future Directions

The findings underscore the importance of choosing the right NLU platform based on specific application needs and performance requirements. For instance, applications requiring high precision in entity types might prefer LUIS, whereas scenarios prioritizing intent recognition accuracy would benefit from Watson's capabilities.

The authors propose the future improvement of dataset quality and inclusion of spoken rather than typed user inputs to evaluate the impact of ASR errors. These developments could yield insights into the integration of speech recognition with NLU, further enhancing the practical applications of SDS.

Overall, this comparative evaluation provides critical insights into the current capabilities and limitations of popular NLU services, serving as a valuable reference for researchers and developers in the conversational AI domain. The release of the annotated dataset and evaluation toolkit amplifies its impact, offering a resource for further exploration and enhancement of NLU technologies.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Xingkun Liu (6 papers)
  2. Arash Eshghi (23 papers)
  3. Pawel Swietojanski (11 papers)
  4. Verena Rieser (58 papers)
Citations (242)