Relational Deep Learning: Graph Representation Learning on Relational Databases (2312.04615v1)
Abstract: Much of the world's most valued data is stored in relational databases and data warehouses, where the data is organized into many tables connected by primary-foreign key relations. However, building machine learning models using this data is both challenging and time consuming. The core problem is that no machine learning method is capable of learning on multiple tables interconnected by primary-foreign key relations. Current methods can only learn from a single table, so the data must first be manually joined and aggregated into a single training table, the process known as feature engineering. Feature engineering is slow, error prone and leads to suboptimal models. Here we introduce an end-to-end deep representation learning approach to directly learn on data laid out across multiple tables. We name our approach Relational Deep Learning (RDL). The core idea is to view relational databases as a temporal, heterogeneous graph, with a node for each row in each table, and edges specified by primary-foreign key links. Message Passing Graph Neural Networks can then automatically learn across the graph to extract representations that leverage all input data, without any manual feature engineering. Relational Deep Learning leads to more accurate models that can be built much faster. To facilitate research in this area, we develop RelBench, a set of benchmark datasets and an implementation of Relational Deep Learning. The data covers a wide spectrum, from discussions on Stack Exchange to book reviews on the Amazon Product Catalog. Overall, we define a new research area that generalizes graph machine learning and broadens its applicability to a wide set of AI use cases.
- Deep speech 2: End-to-end speech recognition in english and mandarin. In International Conference on Machine Learning (ICML), 2016.
- Tabnet: Attentive interpretable tabular learning. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 6679–6687, 2021.
- Translating embeddings for modeling multi-relational data. In Advances in Neural Information Processing Systems (NeurIPS), 2013.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
- Sequel: A structured english query language. In Proceedings of the 1974 ACM SIGFIDET (now SIGMOD) workshop on Data description, access and control, pages 249–264, 1974.
- Trompt: Towards a better deep neural network for tabular data. In International Conference on Machine Learning (ICML), 2023.
- Xgboost: A scalable tree boosting system. In ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), pages 785–794, 2016.
- Edgar F Codd. A relational model of data for large shared data banks. Communications of the ACM, 13(6):377–387, 1970.
- Milan Cvitkovic. Supervised learning on relational databases with graph neural networks. ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019.
- DB-Engines. DBMS popularity broken down by database model, 2023. Available: https://db-engines.com/en/ranking_categories.
- Luc De Raedt. Logical and relational learning. Springer Science & Business Media, 2008.
- Bert: Pre-training of deep bidirectional transformers for language understanding. In North American Chapter of the Association for Computational Linguistics (NAACL), 2018.
- Fast graph representation learning with pytorch geometric. ICLR 2019 (RLGM Workshop), 2019.
- Database Systems: The Complete Book. Prentice Hall Press, USA, 2 edition, 2008. ISBN 9780131873254.
- Shortcut learning in deep neural networks. Nature Machine Intelligence, 2(11):665–673, 2020.
- Learning probabilistic relational models. Relational data mining, pages 307–335, 2001.
- Neural message passing for quantum chemistry. In International Conference on Machine Learning (ICML), page 1263–1272, 2017.
- Revisiting deep learning models for tabular data. In Advances in Neural Information Processing Systems (NeurIPS), volume 34, pages 18932–18943, 2021.
- On embeddings for numerical features in tabular deep learning. Advances in Neural Information Processing Systems, 35:24991–25004, 2022.
- Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems (NeurIPS), 2017.
- Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567, 2014.
- Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
- PyTorch Frame: A Deep Learning Framework for Tabular Data, October 2023. URL https://github.com/pyg-team/pytorch-frame.
- Heterogeneous graph transformer. In Proceedings of The Web Conference 2020, page 2704–2710, 2020.
- Tabtransformer: Tabular data modeling using contextual embeddings. arXiv preprint arXiv:2012.06678, 2020.
- Mimic-iii, a freely accessible critical care database. Scientific data, 3(1):1–9, 2016.
- Kaggle. Kaggle Data Science & Machine Learning Survey, 2022. Available: https://www.kaggle.com/code/paultimothymooney/kaggle-survey-2022-all-results/notebook.
- Leakage and the reproducibility crisis in machine-learning-based science. Patterns, 4(9), 2023.
- Inductive logic programming. In WLP, pages 146–160. Springer, 1994.
- Marvin Minsky. A framework for representing knowledge, 1974.
- PubMed. National Center for Biotechnology Information, U.S. National Library of Medicine, 1996. Available: https://www.ncbi.nlm.nih.gov/pubmed/.
- Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084, 2019.
- Markov logic networks. Machine learning, 62:107–136, 2006.
- Temporal graph networks for deep learning on dynamic graphs. ICML Workshop on Graph Representation Learning 2020, 2020.
- Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015.
- Modeling relational data with graph convolutional networks. In Aldo Gangemi, Roberto Navigli, Maria-Esther Vidal, Pascal Hitzler, Raphaël Troncy, Laura Hollink, Anna Tordai, and Mehwish Alam, editors, The Semantic Web, pages 593–607, Cham, 2018. Springer International Publishing.
- Tabular data: Deep learning is not all you need. Information Fusion, 81:84–90, 2022.
- Gustav Šír. Deep Learning with Relational Logic Representations. Czech Technical University, 2021.
- A statistical approach to texture classification from single images. International journal of computer vision, 62:61–81, 2005.
- Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS), 2017.
- Knowledge graph embedding: A survey of approaches and applications. IEEE Transactions on Knowledge and Data Engineering, 29(12):2724–2743, 2017.
- Time-aware neighbor sampling for temporal graph networks. In arXiv pre-print, 2021.
- Knowledge graph embedding by translating on hyperplanes. In Proceedings of the AAAI conference on artificial intelligence, volume 28, 2014.
- What can neural networks reason about? In International Conference on Learning Representations (ICLR), 2020.
- Stam: A spatiotemporal aggregation method for graph neural network-based recommendation. In Proceedings of the ACM Web Conference 2022, pages 3217–3228, 2022.
- A deep learning blueprint for relational databases. In NeurIPS 2023 Second Table Representation Learning Workshop, 2023.
- Rethinking the expressive power of gnns via graph biconnectivity. In International Conference on Learning Representations (ICLR), 2023.
- Xtab: Cross-table pretraining for tabular transformers. In International Conference on Machine Learning (ICML), 2023.