Robust subgroup discovery (2103.13686v4)

Published 25 Mar 2021 in cs.LG, cs.AI, and stat.ML

Abstract: We introduce the problem of robust subgroup discovery, i.e., finding a set of interpretable descriptions of subsets that 1) stand out with respect to one or more target attributes, 2) are statistically robust, and 3) non-redundant. Many attempts have been made to mine either locally robust subgroups or to tackle the pattern explosion, but we are the first to address both challenges at the same time from a global modelling perspective. First, we formulate the broad model class of subgroup lists, i.e., ordered sets of subgroups, for univariate and multivariate targets that can consist of nominal or numeric variables, including traditional top-1 subgroup discovery in its definition. This novel model class allows us to formalise the problem of optimal robust subgroup discovery using the Minimum Description Length (MDL) principle, where we resort to optimal Normalised Maximum Likelihood and Bayesian encodings for nominal and numeric targets, respectively. Second, finding optimal subgroup lists is NP-hard. Therefore, we propose SSD++, a greedy heuristic that finds good subgroup lists and guarantees that the most significant subgroup found according to the MDL criterion is added in each iteration. In fact, the greedy gain is shown to be equivalent to a Bayesian one-sample proportion, multinomial, or t-test between the subgroup and dataset marginal target distributions plus a multiple hypothesis testing penalty. Furthermore, we empirically show on 54 datasets that SSD++ outperforms previous subgroup discovery methods in terms of quality, generalisation on unseen data, and subgroup list size.

Authors (4)

Hugo Manuel Proença (4 papers)
Peter Grünwald (43 papers)
Thomas Bäck (121 papers)
Matthijs van Leeuwen (24 papers)

Citations (11)

View on Semantic Scholar

Summary

The paper introduces a subgroup list model that partitions data into non-overlapping segments to detect statistically robust patterns.
It leverages the MDL principle with specialized encoding for nominal and numeric targets to optimize subgroup selection globally.
Experiments on 54 datasets confirm that SSD++ outperforms traditional methods by enhancing interpretability and generalization.

Robust Subgroup Discovery: An Overview

The paper "Robust Subgroup Discovery" addresses two fundamental issues in the domain of subgroup discovery: the redundancy of discovered subgroups and statistical robustness against false discoveries. The authors propose a novel approach that simultaneously addresses both challenges using the Minimum Description Length (MDL) principle to discover optimal subgroup lists. This method is grounded in information theory and involves partitioning data into non-overlapping segments to create descriptive patterns that deviate significantly from the dataset’s overall distribution.

Key Contributions

Subgroup List Model Class: The authors present a new model class for subgroup discovery called subgroup lists, which are essentially ordered sets of subsets forming a partition of the dataset. This approach allows for the sequencing of subgroups that cover different areas of the data, ensuring that each subgroup is statistically robust and minimizes redundancy. By fixing the default rule to the dataset's overall distribution, the discovery of subgroups focuses on significant deviations, making the approach suitable for univariate and multivariate targets, both nominal and numeric.
MDL-Based Optimal Criterion: The paper introduces a global optimality criterion for subgroup lists using the MDL principle. For nominal targets, they utilize the Normalized Maximum Likelihood encoding, which achieves minimax optimal code length regret, while for numeric targets, a Bayesian encoding approach with non-informative priors is employed. This ensures that subgroups are not only locally but also globally significant by taking into account statistical robustness against chance.
Novel Greedy Algorithm (SSD++): To tackle the NP-hard problem of finding the globally optimal subgroup list, the authors propose SSD++, a heuristic algorithm that iteratively constructs a subgroup list. At each step, the most statistically significant subgroup according to the MDL criterion is added, ensuring each addition improves the model by performing a Bayesian hypothesis test between subgroup-specific and dataset-wide distributions.
Experiments and Validation: The algorithm’s efficacy is empirically validated on a diverse set of 54 datasets with various types of targets. SSD++ is shown to outperform traditional methods in terms of the quality of the discovered subgroups by measuring the Sum of Weighted Kullback-Leibler divergences. It achieves better generalization on unseen data, showcasing the statistical robustness of incorporating the MDL principle into subgroup discovery.
Case Study: The paper concludes with a real-world application in analyzing how socioeconomic factors influence the academic performance of engineering students in Colombia, demonstrating the practical utility of the method in uncovering actionable insights from data.

Implications and Future Directions

The presented work has significant implications for the field of subgroup discovery and pattern mining:

Interpretability: By focusing on sequential subgroup discovery, the method enhances interpretability, allowing users to understand and analyze the importance and contribution of each subgroup within its list.
Generalization: The method's robustness against overfitting ensures that the discovered patterns represent true deviations rather than artifacts of the dataset, which is crucial for domains requiring high confidence in pattern validity.
Scalability and Extensibility: Future work could explore scaling the approach to larger datasets and extending the framework to handle mixed data types, further expanding the applicability of the method to a broader range of real-world problems.

Overall, the authors successfully bridge the gap in subgroup discovery by providing a framework that not only discovers novel and significant patterns but ensures these patterns are statistically sound and interpretable. Their work lays a foundation for further advancements in exploring robust and interpretable pattern mining techniques.

PDF Markdown

Related Papers

YouTube

Show All Videos