Neural Oblivious Decision Ensembles for Deep Learning on Tabular Data (1909.06312v2)

Published 13 Sep 2019 in cs.LG and stat.ML

Abstract: Nowadays, deep neural networks (DNNs) have become the main instrument for machine learning tasks within a wide range of domains, including vision, NLP, and speech. Meanwhile, in an important case of heterogenous tabular data, the advantage of DNNs over shallow counterparts remains questionable. In particular, there is no sufficient evidence that deep learning machinery allows constructing methods that outperform gradient boosting decision trees (GBDT), which are often the top choice for tabular problems. In this paper, we introduce Neural Oblivious Decision Ensembles (NODE), a new deep learning architecture, designed to work with any tabular data. In a nutshell, the proposed NODE architecture generalizes ensembles of oblivious decision trees, but benefits from both end-to-end gradient-based optimization and the power of multi-layer hierarchical representation learning. With an extensive experimental comparison to the leading GBDT packages on a large number of tabular datasets, we demonstrate the advantage of the proposed NODE architecture, which outperforms the competitors on most of the tasks. We open-source the PyTorch implementation of NODE and believe that it will become a universal framework for machine learning on tabular data.

Citations (260)

View on Semantic Scholar

Summary

The paper introduces NODE, a fully differentiable architecture that bridges deep neural networks and ensemble methods to outperform standard GBDTs.
It employs the entmax transformation for feature selection and decision making within a multi-layer oblivious decision tree framework.
Extensive experiments on datasets like Higgs and YearPrediction demonstrate NODE's superior performance, scalability, and versatility.

Neural Oblivious Decision Ensembles for Deep Learning on Tabular Data

The paper "Neural Oblivious Decision Ensembles for Deep Learning on Tabular Data" addresses the longstanding challenge of leveraging deep neural networks (DNNs) effectively within the domain of tabular data. Gradient Boosting Decision Trees (GBDT) have been the preferred choice for many tabular data problems due to their consistent performance. However, this work introduces a novel DNN architecture, Neural Oblivious Decision Ensembles (NODE), aiming to bridge the apparent performance gap and outperform state-of-the-art GBDT methodologies on various datasets.

The NODE architecture is designed with an emphasis on integrating the merits of both DNNs and traditional ensemble methods, specifically those that utilize oblivious decision trees—a choice motivated by the performance success of CatBoost, an advanced GBDT package. NODE combines the adaptability of end-to-end gradient-based optimization with the robustness of ensemble learning through oblivious decision trees and hierarchical representation learning. The authors present a detailed experimental comparison, showing that NODE achieves superior results over leading GBDT implementations across a suite of benchmark tabular datasets.

Architecture of NODE

NODE extends the concept of oblivious decision trees by introducing a fully differentiable mechanism for both feature selection and decision-making processes through the entmax transformation. This enhancement permits the seamless inclusion of the NODE layer within any computational graph compatible with modern deep learning frameworks like PyTorch and TensorFlow. The NODE architecture supports the construction of multi-layer ensembles, akin to "deep" versions of GBDTs, which undergo joint training processes to optimize performance objectives. This innovative approach, differentiable throughout, represents a significant step in evolving tree-based models into a deep learning context.

Experimental Results

The paper reports impressive experimental results, claiming that NODE consistently surpasses competitive GBDT implementations (CatBoost and XGBoost) on several datasets. Key datasets include Epsilon, YearPrediction, and Higgs, showcasing diverse characteristics pertinent to classification, regression, and ranking tasks. NODE's superior performance is attributed to its ability to leverage both deep representation learning and the structural strengths of decision tree ensembles. Specifically, NODE maintains competitiveness in scenarios optimized for either default or finely tuned hyperparameters, underscoring its versatility.

Implications and Future Work

The introduction of NODE has notable implications for both academic research and practical applications in machine learning. The potential to integrate NODE within broader ML systems that handle tabular data extends its utility beyond standalone applications. This paper opens avenues for enhancing multi-modal systems where tabular data components need to be efficiently processed alongside image or sequence data streams. Future research may explore extensions of NODE to include non-oblivious decision tree structures or investigate the impact of incorporating additional differentiable components.

Lastly, the paper invites further investigation into NODE's scalability and integration capabilities within complex pipelines. The public release of a PyTorch implementation is a call to action for the broader research community to engage with and innovate upon this foundational work.

PDF Markdown