Survey of state-of-the-art mixed data clustering algorithms (1811.04364v6)

Published 11 Nov 2018 in cs.LG, cs.AI, and stat.ML

Abstract: Mixed data comprises both numeric and categorical features, and mixed datasets occur frequently in many domains, such as health, finance, and marketing. Clustering is often applied to mixed datasets to find structures and to group similar objects for further analysis. However, clustering mixed data is challenging because it is difficult to directly apply mathematical operations, such as summation or averaging, to the feature values of these datasets. In this paper, we present a taxonomy for the study of mixed data clustering algorithms by identifying five major research themes. We then present a state-of-the-art review of the research works within each research theme. We analyze the strengths and weaknesses of these methods with pointers for future research directions. Lastly, we present an in-depth analysis of the overall challenges in this field, highlight open research questions and discuss guidelines to make progress in the field.

Citations (160)

View on Semantic Scholar

Summary

The paper surveys state-of-the-art clustering algorithms for mixed numerical and categorical data, categorizing them into partitional, hierarchical, model-based, neural network-based, and other methods.
The survey highlights adaptations of standard clustering methods like K-means (K-prototypes) and techniques such as Gower's similarity measure to effectively handle datasets with mixed feature types.
Key challenges identified include the lack of consensus on robust similarity measures, the need for scalable algorithms for large mixed datasets, and the scarcity of public datasets and algorithm implementations for comparison.

Survey of State-of-the-Art Mixed Data Clustering Algorithms

The reviewed paper presents a comprehensive survey of clustering algorithms tailored for mixed data, which consists of both numerical and categorical features. Such datasets are prevalent in varied domains including health, finance, and marketing. The challenge in clustering mixed data arises primarily from the difficulty in applying standard mathematical operations directly to categorical data, which necessitates specialized approaches that can accommodate both feature types concurrently.

The authors frame the discourse by outlining five major research themes that have emerged in the domain of mixed data clustering: partitional, hierarchical, model-based, neural network-based, and other miscellaneous methods. This taxonomy serves as a basis for exploring significant contributions within each category and allows for a structured analysis of existing methodologies.

Partitional Clustering

Partitional clustering methods, particularly those derived from K-means, are emphasized due to their computational efficiency and adaptability to large datasets. These algorithms typically involve modifications to the cost functions and distance measures to handle mixed features. For example, the K-prototypes algorithm integrates categorical data into the K-means framework by employing a modified cost function. The survey highlights several enhancements of this foundational approach, incorporating novel distance metrics and sophisticated cluster center definitions to better represent mixed data.

Hierarchical Clustering

In hierarchical clustering, the key challenge is constructing a similarity matrix that appropriately combines numerical and categorical data. Gower's similarity measure is frequently cited as a practical solution, balancing the contributions of both data types. Despite their quadratic complexity, hierarchical methods remain relevant due to their ability to provide a nested cluster structure, which can be advantageous in applications requiring hierarchical data representations.

Model-Based Clustering

Model-based approaches utilize statistical models to describe data, presenting challenges in optimizing model selection and parameter settings. Although these methods promise a more nuanced handling of mixed data through probabilistic frameworks and latent variables, they often entail high computational costs and make assumptions about data distributions not always met in practice.

Neural Network-Based Clustering

The application of neural networks in mixed data clustering is predominantly explored through adaptations of self-organizing maps (SOM) and adaptive resonance theory (ART). These methods offer non-linear transformation capabilities, facilitating the clustering of intricately structured data. However, their complexity and potential for suboptimal mapping call for further advancements to ensure reliable clustering outcomes.

Other Methods

The "Other" category encompasses emergent clustering strategies that do not conform neatly to traditional paradigms. This includes ensemble, subspace, and density-based methods, each offering unique perspectives on handling mixed data but often facing scalability or interpretability challenges.

Analysis and Future Directions

The survey underscores the practical and theoretical implications of clustering mixed datasets. Most notably, it acknowledges the necessity for balanced and interpretable models, especially as clustering applications extend into high-impact fields like health informatics and business analytics. The lack of consensus on robust similarity measures and scalable algorithms for large, complex data remains a central challenge. Furthermore, the authors call for more widespread availability of public datasets and algorithm implementations to foster comparison and innovation.

In conclusion, while significant progress has been made across various clustering methodologies, the necessity for further research remains vital. The paper highlights several open questions, ranging from improved cluster initialization techniques in partitional methods to the development of interpretable models that can provide actionable insights in practical applications. The thoughtful articulation of a taxonomy and exploration of existing work set a foundation for ongoing exploration in the dynamic field of mixed data clustering.