Retrieval & Fine-Tuning for In-Context Tabular Models (2406.05207v1)
Abstract: Tabular data is a pervasive modality spanning a wide range of domains, and the inherent diversity poses a considerable challenge for deep learning. Recent advancements using transformer-based in-context learning have shown promise on smaller and less complex datasets, but have struggled to scale to larger and more complex ones. To address this limitation, we propose a combination of retrieval and fine-tuning: we can adapt the transformer to a local subset of the data by collecting nearest neighbours, and then perform task-specific fine-tuning with this retrieved set of neighbours in context. Using TabPFN as the base model -- currently the best tabular in-context learner -- and applying our retrieval and fine-tuning scheme on top results in what we call a locally-calibrated PFN, or LoCalPFN. We conduct extensive evaluation on 95 datasets curated by TabZilla from OpenML, upon which we establish a new state-of-the-art with LoCalPFN -- even with respect to tuned tree-based models. Notably, we show a significant boost in performance compared to the base in-context model, demonstrating the efficacy of our approach and advancing the frontier of deep learning in tabular data.
- Deep reinforcement learning at the edge of the statistical precipice. In Advances in Neural Information Processing Systems, 2021.
- Tabnet: Attentive interpretable tabular learning. In AAAI Conference on Artificial Intelligence, pages 6679–6687, 2021.
- Google dataset search by the numbers. In The Semantic Web – ISWC 2020, pages 667–682, 2020.
- An inductive bias for tabular deep learning. In Advances in Neural Information Processing Systems, 2023.
- OpenML benchmarking suites. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, 2021.
- On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
- Elephants never forget: Testing language models for memorization of tabular data. arXiv preprint arXiv:2403.06644, 2024.
- Improving language models by retrieving from trillions of tokens. In International Conference on Machine Learning, pages 2206–2240, 2022.
- Deep neural networks and tabular data: A survey. IEEE Transactions on Neural Networks and Learning Systems, 2022.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems, 2020.
- XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SigKDD International Conference on Knowledge Discovery and Data Mining, pages 785–794, 2016.
- Sequential deep learning for credit risk monitoring with tabular financial data. arXiv preprint arXiv:2012.15330, 2020.
- Locally weighted regression: An approach to regression analysis by local fitting. Journal of the American Statistical Association, 83(403):596–610, 1988.
- Semi-supervised sequence learning. In Advances in Neural Information Processing Systems, 2015.
- Fine-tuning the retrieval mechanism for tabular deep learning. In NeurIPS 2023 Second Table Representation Learning Workshop, 2023.
- LIFT: Language-interfaced fine-tuning for non-language machine learning tasks. In Advances in Neural Information Processing Systems, 2022.
- The Faiss library. arXiv preprint 2401.08281, 2024.
- Large language models (LLMs) on tabular data: Prediction, generation, and understanding – A survey. arXiv preprint arXiv:2402.17944, 2024.
- TuneTables: Context optimization for scalable prior-data fitted networks. arXiv preprint arXiv:2402.11137, 2024.
- TabR: Tabular deep learning meets nearest neighbors. In International Conference on Learning Representations, 2024.
- Why do tree-based models still outperform deep learning on typical tabular data? In Advances in Neural Information Processing Systems, 2022.
- Retrieval augmented language model pre-training. In International Conference on Machine Learning, pages 3929–3938, 2020.
- Trevor Hastie. Generalized additive models. In Statistical models in S, pages 249–307. Routledge, 2017.
- The Elements of Statistical Learning: Data Mining, Inference, and Prediction, volume 2. Springer, 2009.
- TabLLM: Few-shot classification of tabular data with large language models. In International Conference on Artificial Intelligence and Statistics, pages 5549–5581, 2023.
- TabPFN: A transformer that solves small tabular classification problems in a second. In International Conference on Learning Representations, 2023.
- Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146, 2018.
- Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3):535–547, 2019.
- Generalization through memorization: Nearest neighbor language models. arXiv preprint arXiv:1911.00172, 2019.
- Self-attention between datapoints: Going beyond individual input-output pairs in deep learning. In Advances in Neural Information Processing Systems, 2021.
- Retrieval-augmented generation for knowledge-intensive NLP tasks. In Advances in Neural Information Processing Systems, 2020.
- Decoupled weight decay regularization. In International Conference on Learning Representations, 2019.
- TabPFGen – Tabular data generation with TabPFN. In NeurIPS 2023 Second Table Representation Learning Workshop, 2023.
- In-context data distillation with TabPFN. arXiv preprint arXiv:2402.06971, 2024.
- When do neural nets outperform boosted trees on tabular data? In Advances in Neural Information Processing Systems, 2023.
- Transformers can do Bayesian inference. In International Conference on Learning Representations, 2022.
- Coresets-methods and history: A theoreticians design pattern for approximation and streaming algorithms. KI-Künstliche Intelligenz, 32:37–53, 2018.
- DNNR: Differential nearest neighbors regression. In International Conference on Machine Learning, pages 16296–16317, 2022.
- Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114, 2021.
- CatBoost: Unbiased boosting with categorical features. In Advances in Neural Information Processing Systems, 2018.
- Retrieval & interaction machine for tabular data prediction. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 1379–1389, 2021.
- Improving language understanding by generative pre-training, 2018.
- Interpretable machine learning for TabPFN. arXiv preprint arXiv:2403.10923, 2024.
- Tabular data: Deep learning is not all you need. Information Fusion, 81:84–90, 2022.
- SAINT: Improved neural networks for tabular data via row attention and contrastive pre-training. arXiv preprint arXiv:2106.01342, 2021.
- A customer churn prediction model based on XGBoost and MLP. In International Conference on Computer Engineering and Application (ICCEA), pages 608–612, 2020.
- Trust issues: Uncertainty estimation does not enable reliable OOD detection on medical tabular data. In Machine Learning for Health, pages 341–354, 2020.
- Deep learning: A primer for psychologists. Psychological Methods, 26(6):743, 2021.
- Boris van Breugel and Mihaela van der Schaar. Why tabular foundation models should be a research priority. In International Conference on Machine Learning, 2024.
- Vladimir Vapnik. The Nature of Statistical Learning Theory. Springer Science & Business Media, 2013.
- Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, 2022.
- VIME: Extending the success of self-and semi-supervised learning to tabular domain. Advances in Neural Information Processing Systems, 2020.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.