Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Characterizing classification datasets: a study of meta-features for meta-learning (1808.10406v2)

Published 30 Aug 2018 in cs.LG and stat.ML

Abstract: Meta-learning is increasingly used to support the recommendation of machine learning algorithms and their configurations. Such recommendations are made based on meta-data, consisting of performance evaluations of algorithms on prior datasets, as well as characterizations of these datasets. These characterizations, also called meta-features, describe properties of the data which are predictive for the performance of machine learning algorithms trained on them. Unfortunately, despite being used in a large number of studies, meta-features are not uniformly described, organized and computed, making many empirical studies irreproducible and hard to compare. This paper aims to deal with this by systematizing and standardizing data characterization measures for classification datasets used in meta-learning. Moreover, it presents MFE, a new tool for extracting meta-features from datasets and identifying more subtle reproducibility issues in the literature, proposing guidelines for data characterization that strengthen reproducible empirical research in meta-learning.

Citations (30)

Summary

  • The paper proposes a systematic taxonomy of meta-features to improve algorithm selection and meta-learning performance.
  • It develops the Meta-Feature Extractor tool that flexibly computes dataset properties while handling data types and missing values.
  • The study addresses reproducibility challenges by standardizing meta-feature extraction and offering solutions for hyperparameter tuning and data transformation.

Meta-learning and Dataset Characteristics

Meta-learning, the process of learning about learning, has gained momentum in the field of machine learning. One of its key aspects is the recommendation of suitable machine learning algorithms and configurations for new tasks. Recommendations are based on characteristics extracted from datasets, called meta-features. These meta-features encapsulate properties of the data that predict the performance of machine learning models. However, there is a lack of standardization in describing, computing, and organizing meta-features, leading to issues in the reproducibility and comparison of empirical studies.

Systematizing Meta-Features

The paper addresses the aforementioned issues by proposing a systematic approach to defining and categorizing meta-features. It introduces a comprehensive taxonomy that organizes meta-features into groups based on their application to classification tasks and associated attributes. The discussed meta-features fall into several groups: simple, statistical, information-theoretic, model-based, landmarking, and others. Their usefulness varies across different learning tasks, and their calculation depends on the data type (numerical or categorical) and other aspects that can influence a machine learning task.

Challenge of Reproducibility

Reproducibility is critically examined in this paper. Several aspects that have been traditionally overlooked now receive attention, such as handling data type incompatibilities, setting hyperparameters, transforming data ranges, summarizing outcomes, handling exceptions, and dealing with high-dimensional meta-feature spaces. Importantly, the paper presents possible solutions to these issues, ensuring that future meta-learning research can become more systematic and reproducible.

The Meta-Feature Extractor

The Meta-Feature Extractor (MFE) tool has been developed to implement the standardization proposed in the paper. The MFE calculates a wide range of meta-features while offering users the flexibility to tailor the extraction process to their needs. It deals with issues related to data type, missing values, and supports extensive customization through user-defined hyperparameters. Although focused on classification tasks, the tool is a significant step towards reproducible meta-learning studies.

Conclusion and Future Work

This paper makes a critical contribution to meta-learning by standardizing the way we characterize classification datasets and by providing a new tool, the MFE, for computing meta-features efficiently. Future avenues of research include extending the taxonomy to non-classification tasks, improving meta-feature interpretability, and empirically evaluating the effect of characterization choices on meta-learning tasks.