Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Generating Hard-Negative Out-of-Scope Data with ChatGPT for Intent Classification (2403.05640v1)

Published 8 Mar 2024 in cs.CL

Abstract: Intent classifiers must be able to distinguish when a user's utterance does not belong to any supported intent to avoid producing incorrect and unrelated system responses. Although out-of-scope (OOS) detection for intent classifiers has been studied, previous work has not yet studied changes in classifier performance against hard-negative out-of-scope utterances (i.e., inputs that share common features with in-scope data, but are actually out-of-scope). We present an automated technique to generate hard-negative OOS data using ChatGPT. We use our technique to build five new hard-negative OOS datasets, and evaluate each against three benchmark intent classifiers. We show that classifiers struggle to correctly identify hard-negative OOS utterances more than general OOS utterances. Finally, we show that incorporating hard-negative OOS data for training improves model robustness when detecting hard-negative OOS data and general OOS data. Our technique, datasets, and evaluation address an important void in the field, offering a straightforward and inexpensive way to collect hard-negative OOS data and improve intent classifiers' robustness.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. Generating natural language adversarial examples. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP).
  2. Steven Bird and Edward Loper. 2004. NLTK: The natural language toolkit. In Proceedings of the ACL Interactive Poster and Demonstration Sessions.
  3. Efficient intent detection with dual sentence encoders. In Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI.
  4. ChatGPT to replace crowdsourcing of paraphrases for intent classification: Higher diversity and comparable model robustness. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP).
  5. Snips voice platform: an embedded spoken language understanding system for private-by-design voice interfaces. arXiv preprint arXiv:1805.10190.
  6. Expanding the scope of the ATIS task: The ATIS-3 corpus. In Human Language Technology: Proceedings of a Workshop held at Plainsboro, New Jersey, March 8-11, 1994.
  7. AugGPT: Leveraging chatGPT for text data augmentation. arXiv preprint arXiv:2302.13007.
  8. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT).
  9. HotFlip: White-box adversarial examples for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL).
  10. Exploring the limits of out-of-distribution detection. Advances in Neural Information Processing Systems, 34:7068–7081.
  11. Black-box generation of adversarial text sequences to evade deep learning classifiers. In 2018 IEEE Security and Privacy Workshops (SPW).
  12. Siddhant Garg and Goutham Ramakrishnan. 2020. BAE: BERT-based adversarial examples for text classification. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP).
  13. ChatGPT outperforms crowd-workers for text-annotation tasks. Proceedings of the National Academy of Sciences, 120(30).
  14. Explaining and harnessing adversarial examples. In Proceedings of the International Conference on Learning Representations (ICLR).
  15. Semantic parsing for task oriented dialog using hierarchical representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP).
  16. If in a crowdsourced data annotation pipeline, a GPT-4. arXiv preprint arXiv:2402.16795.
  17. The ATIS spoken language systems pilot corpus. In Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24-27,1990.
  18. Dan Hendrycks and Kevin Gimpel. 2016. A baseline for detecting misclassified and out-of-distribution examples in neural networks. In Proceedings of International Conference on Learning Representations (ICLR).
  19. Multi-site data collection and evaluation in spoken language understanding. In Human Language Technology: Proceedings of a Workshop Held at Plainsboro, New Jersey, March 21-24, 1993.
  20. Multi-site data collection for a spoken language corpus. In Speech and Natural Language: Proceedings of a Workshop Held at Harriman, New York, February 23-26, 1992.
  21. Mining hard negative samples for sar-optical image matching using generative adversarial networks. Remote Sensing, 10(10).
  22. Hard negative mixing for contrastive learning. Advances in Neural Information Processing Systems, 33:21798–21809.
  23. Data collection for dialogue system: A startup perspective. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers).
  24. Sopan Khosla and Rashmi Gangadharaiah. 2022. Benchmarking the covariate shift robustness of open-world intent classification approaches. In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (AACL-IJCNLP).
  25. Jonathan K. Kummerfeld. 2021. Quantifying and avoiding unfair qualification labour in crowdsourcing. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP).
  26. Inconsistencies in crowdsourced slot-filling annotations: A typology and identification methods. In Proceedings of the 28th International Conference on Computational Linguistics (COLING).
  27. Stefan Larson and Kevin Leach. 2022a. Redwood: Using collision detection to grow a large-scale intent classification dataset. In Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL).
  28. Stefan Larson and Kevin Leach. 2022b. A survey on intent classification and slot-filling datasets for task-oriented dialog. arXiv preprint arXiv:2207.13211.
  29. Evaluating out-of-distribution performance on document image classifiers. In Proceedings of the Thirty-sixth Conference on Neural Information Processing Systems Datasets and Bench marks Track.
  30. Outlier detection for improved data quality and diversity in dialog systems. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT).
  31. An evaluation dataset for intent classification and out-of-scope prediction. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP).
  32. Iterative feature mining for constraint-based data collection to increase data diversity and model robustness. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP).
  33. Robustness testing of language understanding in task-oriented dialog. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP).
  34. Energy-based out-of-distribution detection. In Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS).
  35. Benchmarking natural language understanding services for building conversational agents. In Proceedings of the Tenth International Workshop on Spoken Dialog Systems Technology (IWSDS).
  36. RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692.
  37. Siamese network features for image matching. In Proceedings of the International Conference on Pattern Recognition (ICPR).
  38. Passage-based BM25 hard negatives: A simple and effective negative sampling strategy for dense retrieval. In Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation.
  39. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12(Oct):2825–2830.
  40. RADDLE: An evaluation benchmark and analysis platform for robust task-oriented dialog systems. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP).
  41. Generating natural language adversarial examples through probability weighted word saliency. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL).
  42. LINGUIST: Language model instruction tuning to generate annotated utterances for intent classification and slot tagging. In Proceedings of the 29th International Conference on Computational Linguistics (COLING).
  43. Data augmentation for intent classification with off-the-shelf large language models. In Proceedings of the 4th Workshop on NLP for Conversational AI.
  44. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.
  45. On the robustness of intent classification and slot labeling in goal-oriented dialog systems to real-world noise. In Proceedings of the 3rd Workshop on Natural Language Processing for Conversational AI.
  46. Nihar Shah and Dengyong Zhou. 2016. No oops, you won’t do it again: mechanisms for self-correction in crowdsourcing. In Proceedings of the 33rd International Conference on Machine Learning (ICML).
  47. Petter Törnberg. 2023. ChatGPT-4 outperforms experts and crowd workers in annotating political twitter messages with zero-shot learning. arXiv preprint arXiv:2304.06588.
  48. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations.
  49. Optimizing dense retrieval model training with hard negatives. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval.
  50. Are pre-trained transformers robust in intent classification? a missing ingredient in evaluation of out-of-scope intent detection. In Proceedings of the 4th Workshop on NLP for Conversational AI.
  51. Can ChatGPT reproduce human-generated labels? a study of social computing tasks. arXiv preprint arXiv:2304.10145.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Zhijian Li (16 papers)
  2. Stefan Larson (15 papers)
  3. Kevin Leach (29 papers)