Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Similarity encoding for learning with dirty categorical variables (1806.00979v1)

Published 4 Jun 2018 in cs.LG, cs.AI, and stat.ML

Abstract: For statistical learning, categorical variables in a table are usually considered as discrete entities and encoded separately to feature vectors, e.g., with one-hot encoding. "Dirty" non-curated data gives rise to categorical variables with a very high cardinality but redundancy: several categories reflect the same entity. In databases, this issue is typically solved with a deduplication step. We show that a simple approach that exposes the redundancy to the learning algorithm brings significant gains. We study a generalization of one-hot encoding, similarity encoding, that builds feature vectors from similarities across categories. We perform a thorough empirical validation on non-curated tables, a problem seldom studied in machine learning. Results on seven real-world datasets show that similarity encoding brings significant gains in prediction in comparison with known encoding methods for categories or strings, notably one-hot encoding and bag of character n-grams. We draw practical recommendations for encoding dirty categories: 3-gram similarity appears to be a good choice to capture morphological resemblance. For very high-cardinality, dimensionality reduction significantly reduces the computational cost with little loss in performance: random projections or choosing a subset of prototype categories still outperforms classic encoding approaches.

Citations (194)

Summary

  • The paper introduces similarity encoding to capture inherent string similarities, outperforming traditional one-hot encoding in handling dirty categorical variables.
  • It employs n-gram similarity measures and dimensionality reduction techniques to efficiently process high-cardinality, non-curated datasets.
  • Empirical results on seven diverse datasets demonstrate robust gains, making the approach valuable for applications in healthcare, marketing, and finance.

Similarity Encoding for Learning with Dirty Categorical Variables

The paper, "Similarity Encoding for Learning with Dirty Categorical Variables," tackles the pervasive issue of using categorical variables in statistical learning, particularly when dealing with high-cardinality or non-curated data. The authors propose a novel encoding method—similarity encoding—as an advancement over traditional methods like one-hot encoding, which often prove inefficient in dealing with such complex datasets.

Categorical variables are commonly treated as discrete entities and encoded into numerical feature vectors, but in practice, these variables can have high cardinality and redundancy due to data errors like typographical mistakes, aliasing, and inconsistent formatting. Traditional one-hot encoding creates a feature vector per category but does not leverage any inherent similarities between categories, leading to inefficiencies and inaccuracies when used in non-curated datasets. Recognizing this, the paper suggests that instead of preprocessing the data to deduplicate entries (a common approach in databases), we could expose these redundancies directly to the learning algorithms.

The innovative method proposed is similarity encoding, a generalization of one-hot encoding based on string similarities within the data. The research showcases significant practical gains when applying similarity encoding to real-world, non-curated data tables, focusing on predicting multiple outputs on seven distinct datasets. Unlike one-hot encoding, similarity encoding can manage high cardinality and gracefully handle categories unseen in the training set.

Key numerical insights in the paper highlight that similarity encoding, particularly using 3-gram similarity, consistently outperforms other string similarity measures like Levenshtein-ratio and Jaro-Winkler, as well as traditional encoding techniques, across various model types including gradient boosting and ridge regression. This improvement remains robust even when dimensionality reduction is applied—demonstrating the approach's scalability and efficiency in practical scenarios.

The authors conducted empirical studies using diverse real-world datasets with high-cardinality categorical variables, providing a comprehensive evaluation of similarity encoding against classic techniques. This methodological advancement allows for better handling of morphological resemblance among categories and adapts efficiently to out-of-sample categories, which are common issues when working with extensive and dirty datasets.

The paper's exploration of dimensionality reduction techniques further strengthens the argument for similarity encoding. By employing random projections and selecting a subset of representative categories, it is possible to manage the potential computational expense of the approach without significant performance loss.

In terms of implications, the research suggests significant improvements in handling non-curated categorical data within predictive modeling, paving the way for more accurate and computationally efficient data processing in AI applications. Practically, this method is particularly suited to fields relying on expansive tabular data, such as healthcare, marketing, and finance, where datasets often have high-cardinality categorical variables with many data redundancies.

In conclusion, the proposed similarity encoding method represents a significant step forward in categorical data handling. It not only enhances the accuracy of predictive models but also reduces computational demands in situations dealing with high-cardinality and non-curated datasets, pointing towards a new paradigm in handling categorical variables for machine learning practitioners. Future research may explore further refinements in similarity metrics and expand upon the integration of this encoding with diverse machine learning models, thereby expanding the applicability and effectiveness of this approach across even broader datasets and applications.