AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data (2003.06505v1)

Published 13 Mar 2020 in stat.ML and cs.LG

Abstract: We introduce AutoGluon-Tabular, an open-source AutoML framework that requires only a single line of Python to train highly accurate machine learning models on an unprocessed tabular dataset such as a CSV file. Unlike existing AutoML frameworks that primarily focus on model/hyperparameter selection, AutoGluon-Tabular succeeds by ensembling multiple models and stacking them in multiple layers. Experiments reveal that our multi-layer combination of many models offers better use of allocated training time than seeking out the best. A second contribution is an extensive evaluation of public and commercial AutoML platforms including TPOT, H2O, AutoWEKA, auto-sklearn, AutoGluon, and Google AutoML Tables. Tests on a suite of 50 classification and regression tasks from Kaggle and the OpenML AutoML Benchmark reveal that AutoGluon is faster, more robust, and much more accurate. We find that AutoGluon often even outperforms the best-in-hindsight combination of all of its competitors. In two popular Kaggle competitions, AutoGluon beat 99% of the participating data scientists after merely 4h of training on the raw data.

Authors (7)

Nick Erickson (10 papers)
Jonas Mueller (37 papers)
Alexander Shirkov (2 papers)
Hang Zhang (164 papers)
Pedro Larroy (3 papers)
Mu Li (95 papers)
Alexander Smola (7 papers)

Citations (510)

View on Semantic Scholar

Summary

An Overview of AutoGluon-Tabular: Automating ML for Structured Data

The research paper introduces AutoGluon-Tabular, a robust and open-source AutoML framework designed for structured data. Unlike many existing AutoML systems that prioritize model selection and hyperparameter optimization (CASH), AutoGluon-Tabular emphasizes ensemble learning and multi-layer stacking to enhance predictive accuracy. The central claim of the paper is that these techniques, often overlooked or inadequately implemented in other frameworks, can significantly improve performance on tabular datasets without the complexities of traditional CASH strategies.

Key Contributions

The paper delineates several notable contributions:

Ensemble Learning and Stacking: The framework employs a sophisticated ensemble strategy, involving multiple layers of stacking to aggregate models. This approach diverges from traditional shallow stacking by incorporating a broader array of models both as base learners and as stackers. The stackers exploit inputs from previous layers along with the original dataset, enhancing the model's capacity to revisit initial feature interactions.
Advanced Data Processing: AutoGluon-Tabular automatically detects and preprocesses data types, effectively handling missing values and categorical variables. This model-agnostic preprocessing is followed by model-specific transformations, allowing varied models to train on tailored versions of the dataset.
Neural Network Architecture: The paper outlines a neural architecture characterized by embedding layers for categorical features and skip-connections that ensure robust gradient flow. This design, coupled with dropout and batch normalization techniques, optimizes the network's ability to learn from mixed-data types.
Extensive Benchmarking and Evaluation: AutoGluon-Tabular demonstrates superior performance across several public benchmarks, often surpassing other prominent AutoML frameworks like H2O, TPOT, and auto-sklearn. Remarkably, AutoGluon frequently outperformed expert data science teams in popular Kaggle competitions, indicating its practical efficacy.

Experimental Insights

The experiments reveal AutoGluon's aptitude for diverse and complex datasets. Through benchmarking on 50 curated datasets from OpenML and Kaggle, findings suggest substantial accuracy gains from the framework's ensemble strategies. AutoGluon consistently ranked first or second on average, demonstrating both robustness and superior predictive capabilities over other frameworks.

Implications for AI and AutoML

The implications of AutoGluon-Tabular extend beyond mere performance metrics. By shifting focus from CASH to ensemble strategies and automated data processing, the framework simplifies the ML pipeline for practitioners. This widens the accessibility of advanced ML techniques for non-experts and reduces the computational burden usually associated with exhaustive hyperparameter tuning.

Future Directions

The paper suggests future research avenues, such as integrating more flexible model selection strategies and extending AutoGluon's capabilities to encompass unsupervised and semi-supervised learning. Furthermore, expanding support for more complex data structures could lead to broader applicability.

Conclusion

AutoGluon-Tabular stands as a compelling solution in the automation of machine learning for structured data, emphasizing ensemble approaches over traditional model-tuning processes. The research sets a precedent for future developments in AutoML, advocating for a paradigm where simplicity and performance are not mutually exclusive but rather synergistic.

PDF Markdown

Related Papers

Tweets

https://twitter.com/jrosell/status/1853714539570757715