Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 99 tok/s
Gemini 2.5 Pro 60 tok/s Pro
GPT-5 Medium 32 tok/s
GPT-5 High 27 tok/s Pro
GPT-4o 102 tok/s
GPT OSS 120B 461 tok/s Pro
Kimi K2 227 tok/s Pro
2000 character limit reached

Retrieval & Fine-Tuning for In-Context Tabular Models (2406.05207v1)

Published 7 Jun 2024 in cs.LG

Abstract: Tabular data is a pervasive modality spanning a wide range of domains, and the inherent diversity poses a considerable challenge for deep learning. Recent advancements using transformer-based in-context learning have shown promise on smaller and less complex datasets, but have struggled to scale to larger and more complex ones. To address this limitation, we propose a combination of retrieval and fine-tuning: we can adapt the transformer to a local subset of the data by collecting nearest neighbours, and then perform task-specific fine-tuning with this retrieved set of neighbours in context. Using TabPFN as the base model -- currently the best tabular in-context learner -- and applying our retrieval and fine-tuning scheme on top results in what we call a locally-calibrated PFN, or LoCalPFN. We conduct extensive evaluation on 95 datasets curated by TabZilla from OpenML, upon which we establish a new state-of-the-art with LoCalPFN -- even with respect to tuned tree-based models. Notably, we show a significant boost in performance compared to the base in-context model, demonstrating the efficacy of our approach and advancing the frontier of deep learning in tabular data.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. Deep reinforcement learning at the edge of the statistical precipice. In Advances in Neural Information Processing Systems, 2021.
  2. Tabnet: Attentive interpretable tabular learning. In AAAI Conference on Artificial Intelligence, pages 6679–6687, 2021.
  3. Google dataset search by the numbers. In The Semantic Web – ISWC 2020, pages 667–682, 2020.
  4. An inductive bias for tabular deep learning. In Advances in Neural Information Processing Systems, 2023.
  5. OpenML benchmarking suites. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, 2021.
  6. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
  7. Elephants never forget: Testing language models for memorization of tabular data. arXiv preprint arXiv:2403.06644, 2024.
  8. Improving language models by retrieving from trillions of tokens. In International Conference on Machine Learning, pages 2206–2240, 2022.
  9. Deep neural networks and tabular data: A survey. IEEE Transactions on Neural Networks and Learning Systems, 2022.
  10. Language models are few-shot learners. In Advances in Neural Information Processing Systems, 2020.
  11. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SigKDD International Conference on Knowledge Discovery and Data Mining, pages 785–794, 2016.
  12. Sequential deep learning for credit risk monitoring with tabular financial data. arXiv preprint arXiv:2012.15330, 2020.
  13. Locally weighted regression: An approach to regression analysis by local fitting. Journal of the American Statistical Association, 83(403):596–610, 1988.
  14. Semi-supervised sequence learning. In Advances in Neural Information Processing Systems, 2015.
  15. Fine-tuning the retrieval mechanism for tabular deep learning. In NeurIPS 2023 Second Table Representation Learning Workshop, 2023.
  16. LIFT: Language-interfaced fine-tuning for non-language machine learning tasks. In Advances in Neural Information Processing Systems, 2022.
  17. The Faiss library. arXiv preprint 2401.08281, 2024.
  18. Large language models (LLMs) on tabular data: Prediction, generation, and understanding – A survey. arXiv preprint arXiv:2402.17944, 2024.
  19. TuneTables: Context optimization for scalable prior-data fitted networks. arXiv preprint arXiv:2402.11137, 2024.
  20. TabR: Tabular deep learning meets nearest neighbors. In International Conference on Learning Representations, 2024.
  21. Why do tree-based models still outperform deep learning on typical tabular data? In Advances in Neural Information Processing Systems, 2022.
  22. Retrieval augmented language model pre-training. In International Conference on Machine Learning, pages 3929–3938, 2020.
  23. Trevor Hastie. Generalized additive models. In Statistical models in S, pages 249–307. Routledge, 2017.
  24. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, volume 2. Springer, 2009.
  25. TabLLM: Few-shot classification of tabular data with large language models. In International Conference on Artificial Intelligence and Statistics, pages 5549–5581, 2023.
  26. TabPFN: A transformer that solves small tabular classification problems in a second. In International Conference on Learning Representations, 2023.
  27. Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146, 2018.
  28. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3):535–547, 2019.
  29. Generalization through memorization: Nearest neighbor language models. arXiv preprint arXiv:1911.00172, 2019.
  30. Self-attention between datapoints: Going beyond individual input-output pairs in deep learning. In Advances in Neural Information Processing Systems, 2021.
  31. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Advances in Neural Information Processing Systems, 2020.
  32. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019.
  33. TabPFGen – Tabular data generation with TabPFN. In NeurIPS 2023 Second Table Representation Learning Workshop, 2023.
  34. In-context data distillation with TabPFN. arXiv preprint arXiv:2402.06971, 2024.
  35. When do neural nets outperform boosted trees on tabular data? In Advances in Neural Information Processing Systems, 2023.
  36. Transformers can do Bayesian inference. In International Conference on Learning Representations, 2022.
  37. Coresets-methods and history: A theoreticians design pattern for approximation and streaming algorithms. KI-Künstliche Intelligenz, 32:37–53, 2018.
  38. DNNR: Differential nearest neighbors regression. In International Conference on Machine Learning, pages 16296–16317, 2022.
  39. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114, 2021.
  40. CatBoost: Unbiased boosting with categorical features. In Advances in Neural Information Processing Systems, 2018.
  41. Retrieval & interaction machine for tabular data prediction. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 1379–1389, 2021.
  42. Improving language understanding by generative pre-training, 2018.
  43. Interpretable machine learning for TabPFN. arXiv preprint arXiv:2403.10923, 2024.
  44. Tabular data: Deep learning is not all you need. Information Fusion, 81:84–90, 2022.
  45. SAINT: Improved neural networks for tabular data via row attention and contrastive pre-training. arXiv preprint arXiv:2106.01342, 2021.
  46. A customer churn prediction model based on XGBoost and MLP. In International Conference on Computer Engineering and Application (ICCEA), pages 608–612, 2020.
  47. Trust issues: Uncertainty estimation does not enable reliable OOD detection on medical tabular data. In Machine Learning for Health, pages 341–354, 2020.
  48. Deep learning: A primer for psychologists. Psychological Methods, 26(6):743, 2021.
  49. Boris van Breugel and Mihaela van der Schaar. Why tabular foundation models should be a research priority. In International Conference on Machine Learning, 2024.
  50. Vladimir Vapnik. The Nature of Statistical Learning Theory. Springer Science & Business Media, 2013.
  51. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, 2022.
  52. VIME: Extending the success of self-and semi-supervised learning to tabular domain. Advances in Neural Information Processing Systems, 2020.
Citations (2)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets