Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Are Pretrained Transformers Robust in Intent Classification? A Missing Ingredient in Evaluation of Out-of-Scope Intent Detection (2106.04564v3)

Published 8 Jun 2021 in cs.CL and cs.AI

Abstract: Pre-trained Transformer-based models were reported to be robust in intent classification. In this work, we first point out the importance of in-domain out-of-scope detection in few-shot intent recognition tasks and then illustrate the vulnerability of pre-trained Transformer-based models against samples that are in-domain but out-of-scope (ID-OOS). We construct two new datasets, and empirically show that pre-trained models do not perform well on both ID-OOS examples and general out-of-scope examples, especially on fine-grained few-shot intent detection tasks. To figure out how the models mistakenly classify ID-OOS intents as in-scope intents, we further conduct analysis on confidence scores and the overlapping keywords, as well as point out several prospective directions for future work. Resources are available on https://github.com/jianguoz/Few-Shot-Intent-Detection.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Jianguo Zhang (97 papers)
  2. Kazuma Hashimoto (34 papers)
  3. Yao Wan (70 papers)
  4. Zhiwei Liu (114 papers)
  5. Ye Liu (153 papers)
  6. Caiming Xiong (337 papers)
  7. Philip S. Yu (592 papers)
Citations (35)

Summary

An Analysis of Pre-trained Transformers' Robustness in Intent Classification

This paper presents a thorough examination of the robustness of pre-trained Transformer-based models in intent classification, specifically within the context of out-of-scope (OOS) intent detection. The research highlights a critical aspect that is often overlooked in prior studies—the importance of handling in-domain, out-of-scope (ID-OOS) intents, particularly in fine-grained few-shot learning scenarios.

The research proceeds with the ambition to scrutinize the classification effectiveness of state-of-the-art pre-trained models, such as BERT, RoBERTa, ALBERT, ELECTRA, and ToD-BERT. The authors construct two new datasets named CLINC-Single-Domain-OOS and BANKING77-OOS. These datasets are pivotal as they introduce semantically similar ID-OOS examples alongside more traditionally considered OOD-OOS examples. The evaluation leverages both datasets to assess these models in few-shot settings, thoroughly examining in-scope accuracy, OOS recall, and OOS precision.

Empirical results reveal significant findings: pre-trained models, while working effectively for OOD-OOS tasks, show substantial limitations in identifying ID-OOS examples. Specifically, RoBERTa demonstrates a relatively better performance among the evaluated models but still struggles with ID-OOS detection. A critical insight from this paper indicates that even when masking key overlapping terms within the dataset, pre-trained models erroneously display high confidence in ID-OOS predictions akin to in-scope intents, highlighting a pervasive issue of over-confidence. For instance, the paper reveals that RoBERTa and other models are especially challenged in fine-grained datasets like BANKING77-OOS, where intent classes are more specialized and overlap semantically.

The paper asserts that the evident performance gaps across experimental domains—and notably on the singular fine-grained domain of BANKING77-OOS—necessitate further explorations into more advanced techniques for precisely discerning ID-OOS examples. The intricacies of these datasets suggest a discrepancy between current model capabilities and the real-world complexity of intent classification tasks in dialog systems.

The paper's findings carry profound implications for the development of more robust intent detection methodologies. From a practical standpoint, these results demand more sophisticated approaches that can dynamically adapt to new, unforeseen tasks and contexts in conversational AI systems. Theoretically, this paper opens pathways for future research into the nuanced separation of closely related intent classes, advocating for methods that leverage contextually enriched semantic embeddings and more precise delineation of intent boundaries.

Overall, this paper is instrumental in underscoring the essential yet unresolved challenges in intent classification, particularly in ensuring robustness against in-domain variations that current transformer models inadequately address. Exploring alternative architectures, improved pre-training objectives, or even hybrid models, could prove vital in bridging these gaps. Given the rapid evolution of Transformer-based architectures, these insights are indispensable for augmenting the efficacy of goal-oriented dialog systems and expanding their applicability.