An Evaluation Dataset for Intent Classification and Out-of-Scope Prediction
The paper introduces a new dataset designed for evaluating intent classification and out-of-scope prediction within task-oriented dialogue systems. This research focuses on developing more robust systems capable of handling queries both within and outside predefined intent classes.
Motivation
Task-oriented dialog systems often need to identify user intents to provide accurate responses. However, a key challenge arises when systems encounter out-of-scope queries—those that do not fit any system-supported intents. Existing datasets inadequately address this issue as they typically encompass only well-defined intent classes. The introduced dataset closes this gap by incorporating both in-scope and out-of-scope queries.
Dataset Overview
The dataset comprises 23,700 queries, including 22,500 in-scope examples spanning 150 intents across ten domains, alongside 1,200 out-of-scope queries. Data collection involved crowdsourcing, with tasks prompting users to generate commands and questions as they would interact with AI systems. The dataset is thoughtfully divided into training, validation, and test sets, ensuring diversity in the distribution of queries.
Variants
The dataset features several variations, such as:
- Small: Reduced data to 50 training queries per intent.
- Imbalanced: Training queries vary across intents.
- OOS+: Including additional out-of-scope training instances to assess robustness.
Evaluation and Results
The paper evaluates various classifiers using this dataset, including SVM, MLP, FastText, CNN, BERT, and platforms like DialogFlow and Rasa. BERT consistently achieves the highest in-scope accuracy, surpassing 96%. However, all models grapple with out-of-scope prediction, with the best-performing method reaching an out-of-scope recall of only 66%.
Different strategies for out-of-scope detection were explored:
- oos-train: An additional out-of-scope intent class.
- oos-threshold: Probability thresholding for class predictions.
- oos-binary: A two-stage classification approach for determining the scope before intent classification.
Implications
The dataset's introduction provides a more realistic benchmark for developing systems that need to distinguish between supported and unsupported queries. The findings highlight the challenge that current models face in out-of-scope query handling, indicating a crucial area for future research. This dataset paves the way for developing more adaptive and error-tolerant conversational AI systems.
Prior Work and Contributions
In contrast to prior datasets that focus on comprehensive query classifications and lack diversity in intents, this dataset emphasizes real-world application by accounting explicitly for out-of-scope handling. By filling this research void, it enables a more comprehensive evaluation of dialogue systems under conditions representing genuine user interactions.
Conclusion
This paper contributes significantly to the field of intent classification and out-of-scope prediction by introducing a dataset that allows for rigorous testing of dialogue systems. While current methods demonstrate high in-scope accuracy, they reveal limited performance on out-of-scope queries, underscoring the need for further research to enhance dialog systems' robustness and reliability. The dataset and findings forge a path towards more comprehensive conversational AI benchmarks.