Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Subword Semantic Hashing for Intent Classification on Small Datasets (1810.07150v3)

Published 16 Oct 2018 in cs.CL

Abstract: In this paper, we introduce the use of Semantic Hashing as embedding for the task of Intent Classification and achieve state-of-the-art performance on three frequently used benchmarks. Intent Classification on a small dataset is a challenging task for data-hungry state-of-the-art Deep Learning based systems. Semantic Hashing is an attempt to overcome such a challenge and learn robust text classification. Current word embedding based are dependent on vocabularies. One of the major drawbacks of such methods is out-of-vocabulary terms, especially when having small training datasets and using a wider vocabulary. This is the case in Intent Classification for chatbots, where typically small datasets are extracted from internet communication. Two problems arise by the use of internet communication. First, such datasets miss a lot of terms in the vocabulary to use word embeddings efficiently. Second, users frequently make spelling errors. Typically, the models for intent classification are not trained with spelling errors and it is difficult to think about ways in which users will make mistakes. Models depending on a word vocabulary will always face such issues. An ideal classifier should handle spelling errors inherently. With Semantic Hashing, we overcome these challenges and achieve state-of-the-art results on three datasets: AskUbuntu, Chatbot, and Web Application. Our benchmarks are available online: https://github.com/kumar-shridhar/Know-Your-Intent

Citations (30)

Summary

  • The paper proposes a novel semantic hashing technique using n-gram sub-tokens to overcome out-of-vocabulary issues and accurately classify intent on small datasets.
  • It leverages data augmentation with synonym replacement to balance classes and improve classifier robustness across multiple dialogue system datasets.
  • The method achieves state-of-the-art performance with an F1-score up to 0.996 and rapid training/inference, making it ideal for real-time applications.

Semantic Hashing for Enhanced Intent Classification on Limited Data

This paper presents a methodological advance in intent classification, specifically targeting scenarios with limited data availability, such as in chatbots and other dialogue systems. The authors propose the use of Semantic Hashing, combined with subword tokenization, as a robust feature representation technique, facilitating the accurate classification of intents from short textual inputs.

The Context and Challenge

Intent classification is a cornerstone of numerous applications, such as customer service automation and conversational interfaces. However, the task becomes challenging when datasets are small, and user-generated text contains spelling errors or diverse phrasing. Traditional embedding methods often fall short due to the out-of-vocabulary (OOV) problem and the requirement for extensive vocabulary and labeled data for training.

Semantic Hashing, rooted in overcoming OOV limitations and spelling inconsistencies, passes the textual input through a hashing function to produce sub-tokens — smaller parts of words, such as trigrams. This approach creates a fixed-sized dense vector, circumventing vocabulary constraints and allowing for improved feature representation, even from limited data.

Methodology

The novel feature extraction via Semantic Hashing is executed by converting words into n-grams, subsequently constructing a Vector Space Model (VSM) from these subword features. These feature vectors serve as inputs to a variety of classifiers, including Ridge Classifier, Linear SVC, and Random Forest, to achieve accurate intent classification.

The authors explore data augmentation to enhance the training corpus, employing dictionary-based synonym replacement to expand and balance classes, further fortifying the classifier against the typical variability found in small datasets.

Results and Comparisons

The proposed Semantic Hashing technique achieves state-of-the-art performance on the Chatbot, AskUbuntu, and WebApplication datasets. The paper's empirical analysis reports competitive results against existing Natural Language Understanding (NLU) systems, such as Watson, Dialogflow, and Rasa. Notably, the method exhibits superior adaptability and lower computational cost as indicated by its training and inference time measured in seconds and milliseconds, respectively.

For numerical perspective:

  • On the Chatbot corpus, the method achieved an F1-score of up to 0.996.
  • Across all datasets, the method's average performance rivaled prominent NLU system benchmarks, attesting to its generality and robustness.

Implications and Future Directions

The incorporation of subword Semantic Hashing offers a substantial enhancement in handling small dataset challenges, suggesting a potentially broad applicability across various domains with limited data resources. The rapid training and testing times posited in the paper make it feasible for real-time implementation in practical applications such as chatbot platforms.

Future explorations should consider the broader application of Semantic Hashing across different domains and datasets to confirm its versatility and identify any domain-specific adjustments required. Additionally, comparative analyses with emerging feature representation techniques could provide insights into further improvements and performance optimizations.

In conclusion, this paper contributes a powerful, efficient approach to intent classification amidst constraints of small datasets and diverse linguistic variations, enhancing the robustness and applicability of machine learning models in real-world conversational AI systems.