Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MediTab: Scaling Medical Tabular Data Predictors via Data Consolidation, Enrichment, and Refinement (2305.12081v4)

Published 20 May 2023 in cs.LG and cs.AI

Abstract: Tabular data prediction has been employed in medical applications such as patient health risk prediction. However, existing methods usually revolve around the algorithm design while overlooking the significance of data engineering. Medical tabular datasets frequently exhibit significant heterogeneity across different sources, with limited sample sizes per source. As such, previous predictors are often trained on manually curated small datasets that struggle to generalize across different tabular datasets during inference. This paper proposes to scale medical tabular data predictors (MediTab) to various tabular inputs with varying features. The method uses a data engine that leverages LLMs to consolidate tabular samples to overcome the barrier across tables with distinct schema. It also aligns out-domain data with the target task using a "learn, annotate, and refinement" pipeline. The expanded training data then enables the pre-trained MediTab to infer for arbitrary tabular input in the domain without fine-tuning, resulting in significant improvements over supervised baselines: it reaches an average ranking of 1.57 and 1.00 on 7 patient outcome prediction datasets and 3 trial outcome prediction datasets, respectively. In addition, MediTab exhibits impressive zero-shot performances: it outperforms supervised XGBoost models by 8.9% and 17.2% on average in two prediction tasks, respectively.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Tabnet: Attentive interpretable tabular learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp.  6679–6687, 2021.
  2. SCARF: Self-supervised contrastive learning using random feature corruption. In International Conference on Learning Representations, 2022.
  3. Language models are realistic tabular data generators. arXiv preprint arXiv:2210.06280, 2022.
  4. Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877–1901, 2020.
  5. Tabcaps: A capsule neural network for tabular data classification with bow routing. In The Eleventh International Conference on Learning Representations, 2023.
  6. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pp.  785–794, New York, NY, USA, 2016a. ACM. ISBN 978-1-4503-4232-2. doi: 10.1145/2939672.2939785. URL http://doi.acm.org/10.1145/2939672.2939785.
  7. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp.  785–794, 2016b.
  8. Lift: Language-interfaced fine-tuning for non-language machine learning tasks. Advances in Neural Information Processing Systems, 35:11763–11784, 2022.
  9. Hint: Hierarchical interaction network for clinical-trial-outcome predictions. Patterns, 3(4):100445, 2022.
  10. COMPOSE: cross-modal pseudo-siamese network for patient trial matching. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp.  803–812, 2020.
  11. Data shapley: Equitable valuation of data for machine learning. In International Conference on Machine Learning, pp.  2242–2251. PMLR, 2019.
  12. Revisiting deep learning models for tabular data. Advances in Neural Information Processing Systems, 34:18932–18943, 2021.
  13. Tabllm: Few-shot classification of tabular data with large language models. arXiv preprint arXiv:2210.10723, 2022.
  14. Tabtransformer: Tabular data modeling using contextual embeddings. arXiv preprint arXiv:2012.06678, 2020.
  15. Efficient task-specific data valuation for nearest neighbor algorithms. arXiv preprint arXiv:1908.08619, 2019.
  16. Scalability vs. utility: Do we have to sacrifice one for the other in data importance quantification? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  8239–8247, 2021.
  17. Well-tuned simple nets excel on tabular datasets. Advances in Neural Information Processing Systems, 34:23928–23941, 2021.
  18. LightGBM: A highly efficient gradient boosting decision tree. Advances in Neural Information Processing Systems, 30, 2017.
  19. Unifiedqa: Crossing format boundaries with a single qa system. arXiv preprint arXiv:2005.00700, 2020.
  20. Resolving training biases via influence-based data relabeling. In International Conference on Learning Representations, 2021.
  21. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240, 2020.
  22. Transfer learning with deep tabular models. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=b0RuGUYo8pA.
  23. Diwift: Discovering instance-wise influential features for tabular data. In Proceedings of the ACM Web Conference 2023, pp.  1673–1682, 2023.
  24. STUNT: Few-shot tabular learning with self-generated tasks from unlabeled tables. In ICLR, 2023.
  25. Confident learning: Estimating uncertainty in dataset labels. Journal of Artificial Intelligence Research, 70:1373–1411, 2021.
  26. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
  27. SAINT: Improved neural networks for tabular data via row attention and contrastive pre-training. In NeurIPS 2022 First Table Representation Workshop, 2022.
  28. Detecting beneficial feature interactions for recommender systems. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pp.  4357–4365, 2021.
  29. Synthesize high-dimensional longitudinal electronic health records via hierarchical autoregressive language model. Nature Communications, 14(1):5305, 2023.
  30. A deep neural network approach to predicting clinical outcomes of neuroblastoma patients. BMC Medical Genomics, 12:1–11, 2019.
  31. Subtab: Subsetting features of tabular data for self-supervised representation learning. Advances in Neural Information Processing Systems, 34:18853–18865, 2021.
  32. Survtrace: Transformers for survival analysis with competing events. In Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, pp.  1–9, 2022a.
  33. TransTab: Learning transferable tabular transformers across tables. In Advances in Neural Information Processing Systems, 2022b.
  34. Less is better: Unweighted data subsampling via influence function. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp.  6340–6347, 2020.
  35. PyTrial: A python package for artificial intelligence in drug development, 11 2023a. URL https://pytrial.readthedocs.io/en/latest/.
  36. SPOT: Sequential predictive modeling of clinical trial outcome with meta-learning. arXiv preprint arXiv:2304.05352, 2023b.
  37. Modeling tabular data using conditional gan. Advances in Neural Information Processing Systems, 32, 2019.
  38. Tabert: Pretraining for joint understanding of textual and tabular data. arXiv preprint arXiv:2005.08314, 2020.
  39. VIME: Extending the success of self-and semi-supervised learning to tabular domain. Advances in Neural Information Processing Systems, 33:11033–11043, 2020.
  40. Data-centric artificial intelligence: A survey. arXiv preprint arXiv:2303.10158, 2023.
  41. DeepEnroll: patient-trial matching with deep embedding and entailment prediction. In Proceedings of The Web Conference 2020, pp.  1029–1037, 2020.
  42. XTab: Cross-table pretraining for tabular transformers. arXiv preprint arXiv:2305.06090, 2023.
Citations (6)

Summary

We haven't generated a summary for this paper yet.