Discovering outstanding subgroup lists for numeric targets using MDL (2006.09186v1)

Published 16 Jun 2020 in cs.LG and stat.ML

Abstract: The task of subgroup discovery (SD) is to find interpretable descriptions of subsets of a dataset that stand out with respect to a target attribute. To address the problem of mining large numbers of redundant subgroups, subgroup set discovery (SSD) has been proposed. State-of-the-art SSD methods have their limitations though, as they typically heavily rely on heuristics and/or user-chosen hyperparameters. We propose a dispersion-aware problem formulation for subgroup set discovery that is based on the minimum description length (MDL) principle and subgroup lists. We argue that the best subgroup list is the one that best summarizes the data given the overall distribution of the target. We restrict our focus to a single numeric target variable and show that our formalization coincides with an existing quality measure when finding a single subgroup, but that-in addition-it allows to trade off subgroup quality with the complexity of the subgroup. We next propose SSD++, a heuristic algorithm for which we empirically demonstrate that it returns outstanding subgroup lists: non-redundant sets of compact subgroups that stand out by having strongly deviating means and small spread.

Citations (13)

View on Semantic Scholar

Summary

The paper proposes an MDL-based formulation that redefines subgroup discovery as a model selection problem balancing model complexity with data fit.
It introduces SSD++ which enhances subgroup discovery by incorporating both mean and variance, thereby addressing dispersion in numeric targets.
Empirical evaluations show that SSD++ outperforms state-of-the-art methods by using a greedy beam-search algorithm optimized with weighted Kullback-Leibler divergences across datasets.

Discovering Outstanding Subgroup Lists for Numeric Targets Using MDL

The paper "Discovering Outstanding Subgroup Lists for Numeric Targets Using MDL" addresses the task of finding interpretable and meaningful subgroup lists from data in contexts where a numeric target variable is of primary interest. Specifically, the paper focuses on Subgroup Set Discovery (SSD), which seeks to identify non-redundant sets of subgroups that effectively capture significant patterns in the data with respect to a numeric outcome.

Key Contributions

Formulation Using MDL Principle: The authors propose a novel approach to subgroup set discovery by formulating it as a model selection problem rooted in the Minimum Description Length (MDL) principle. The MDL approach offers a principled framework for selecting the most informative and compact models by balancing model complexity with data fit, addressing the limitations of heuristic and hyperparameter-dependent methods that have previously characterized SSD approaches.
Dispersion-Aware Subgroup Lists: The paper introduces a new method, termed SSD++, that leverages both the mean and variance of the numeric target within subgroups, making it sensitive to dispersion. This is a significant advancement as it moves beyond traditional quality measures that focus solely on centrality (e.g., mean), thereby capturing more reliable subgroups with substantive deviations in spread compared to the target's overall distribution.
Algorithm Design and Evaluation: SSD++ is designed as a heuristic algorithm that iteratively refines subgroup lists. It employs a greedy, beam-search based methodology to explore potential subgroups, ensuring that the final list is compact and non-redundant. Through empirical evaluation on multiple datasets, the authors demonstrate that SSD++ consistently outperforms state-of-the-art methods like top-k mining and sequential covering in terms of a proposed evaluation metric—Sum of Weighted Kullback-Leibler divergences (SWKL).

Methodological Insights

Model Class Definition: The authors frame subgroup lists as models composed of ordered rules, where each rule applies a specific probabilistic model (normal distribution) to its covered subset of data. This formulation explicitly addresses the problem of overlapping subgroups by enforcing a sequential model specification.
Subgroup Encoding and Quality Measures: By utilizing Bayesian statistics and the MDL framework, the paper provides a formalization that naturally incorporates subgroup robustness, allowing for the inclusion of subgroup variance in the computation. The research extends standard MDL encoding strategies to consider the Bayesian optimal description of normal distributions with unknown mean and variance, allowing for effective model compression even when dealing with unknown subgroup parameters.

Experimental Validation

The evaluation includes comparisons against baseline methods across several datasets, demonstrating superior performance in terms of SWKL values. Results showed that SSD++ produces subgroup lists that are not only more compact but also more diverse in their coverage of the dataset, indicating its effectiveness in capturing the inherent complexity of the data distributions.

Implications and Future Work

This research presents significant implications for pattern mining, particularly in fields requiring interpretable machine learning models with numeric targets, such as fraud detection and customer segmentation. The integration of MDL in SSD opens avenues for further optimization of the algorithm, potentially reducing computational overhead and expanding applicability to more complex data forms, including multi-target settings.

Future work may explore adaptive mechanisms for setting the size of subgroup lists based on specific problem constraints or user preferences and enhancing interpretability through user-controlled constraints. Additionally, extending the MDL-based framework to accommodate various types of target variables, including categorical data, remains an attractive direction for further research.

In summary, the paper advances the state-of-the-art in subgroup discovery by presenting an MDL-based approach that ensures succinctness and non-redundancy while accommodating both central and dispersion characteristics within numeric targets.

PDF Markdown

Related Papers

YouTube

Show All Videos