Efficient Learned Query Execution over Text and Tables [Technical Report]
Abstract: In this paper, we present ELEET, a novel execution engine that allows one to seamlessly query and process text as a first-class citizen along with tables. To enable such a seamless integration of text and tables, ELEET leverages learned multi-modal operators (MMOps) such as joins and unions that seamlessly combine structured with unstructured textual data. While LLMs (LLM) such as GPT-4 are interesting candidates to enable such learned multimodal operations, we deliberately do not follow this trend to enable MMOps, since it would result in high overhead at query runtime. Instead, to enable MMOps, ELEET comes with a more efficient small LLM (SLM) that is targeted to extract structured data from text. Thanks to our novel architecture and pre-training procedure, the ELEET-model enables high-accuracy extraction with low overheads. In our evaluation, we compare query execution based on ELEET to baselines leveraging LLMs such as GPT-4 and show that ELEET can speed up multi-modal queries over tables and text by up to 575x without sacrificing accuracy.
- Unsupervised Matching of Data and Text. In 38th IEEE International Conference on Data Engineering, ICDE 2022, Kuala Lumpur, Malaysia, May 9-12, 2022. IEEE, 1058–1070. https://doi.org/10.1109/ICDE53745.2022.00084
- PaLM 2 Technical Report. arXiv:2305.10403 [cs]
- Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes. Proc. VLDB Endow. 17, 2 (2023), 92–105. https://www.vldb.org/pvldb/vol17/p92-arora.pdf
- Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (Eds.). https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html
- WebTables: exploring the power of tables on the web. Proc. VLDB Endow. 1, 1 (2008), 538–549. https://doi.org/10.14778/1453856.1453916
- Structured Querying of Web Text Data: A Technical Challenge. In Third Biennial Conference on Innovative Data Systems Research, CIDR 2007, Asilomar, CA, USA, January 7-10, 2007, Online Proceedings. www.cidrdb.org, 225–234. http://cidrdb.org/cidr2007/papers/cidr07p25.pdf
- Join Queries with External Text Sources: Execution and Optimization Techniques. SIGMOD Rec. 24, 2 (May 1995), 410–422. https://doi.org/10.1145/568271.223856
- Symphony: Towards Natural Language Query Answering over Multi-modal Data Lakes. In 13th Conference on Innovative Data Systems Research, CIDR 2023, Amsterdam, The Netherlands, January 8-11, 2023. www.cidrdb.org. https://www.cidrdb.org/cidr2023/papers/p51-chen.pdf
- PaLM: Scaling Language Modeling with Pathways. J. Mach. Learn. Res. 24 (2023), 240:1–240:113. http://jmlr.org/papers/v24/22-1144.html
- A Relational Approach to Incrementally Extracting and Querying Structure in Unstructured Data. In Proceedings of the 33rd International Conference on Very Large Data Bases, University of Vienna, Austria, September 23-27, 2007, Christoph Koch, Johannes Gehrke, Minos N. Garofalakis, Divesh Srivastava, Karl Aberer, Anand Deshpande, Daniela Florescu, Chee Yong Chan, Venkatesh Ganti, Carl-Christian Kanne, Wolfgang Klas, and Erich J. Neuhold (Eds.). ACM, 1045–1056. http://www.vldb.org/conf/2007/papers/research/p1045-chu.pdf
- Structure-Grounded Pretraining for Text-to-SQL. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tür, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (Eds.). Association for Computational Linguistics, 1337–1350. https://doi.org/10.18653/v1/2021.naacl-main.105
- ReasonBERT: Pre-trained to Reason with Distant Supervision. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). Association for Computational Linguistics, 6112–6127. https://doi.org/10.18653/v1/2021.emnlp-main.494
- TURL: Table Understanding through Representation Learning. SIGMOD Rec. 51, 1 (2022), 33–40. https://doi.org/10.1145/3542700.3542709
- QLoRA: Efficient Finetuning of Quantized LLMs. CoRR abs/2305.14314 (2023). https://doi.org/10.48550/ARXIV.2305.14314 arXiv:2305.14314
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, 4171–4186. https://doi.org/10.18653/v1/n19-1423
- T-REx: A Large Scale Alignment of Natural Language with Knowledge Base Triples. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation, LREC 2018, Miyazaki, Japan, May 7-12, 2018, Nicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Kôiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asunción Moreno, Jan Odijk, Stelios Piperidis, and Takenobu Tokunaga (Eds.). European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2018/summaries/632.html
- Saeed Fathollahzadeh and Matthias Boehm. 2023. GIO: Generating Efficient Matrix and Frame Readers for Custom Data Formats by Example. Proc. ACM Manag. Data 1, 2 (2023), 120:1–120:26. https://doi.org/10.1145/3589265
- FlexER: Flexible Entity Resolution for Multiple Intents. Proc. ACM Manag. Data 1, 1 (2023), 42:1–42:27. https://doi.org/10.1145/3588722
- Span Selection Pre-training for Question Answering. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault (Eds.). Association for Computational Linguistics, 2773–2782. https://doi.org/10.18653/v1/2020.acl-main.247
- Michael Gubanov and Philip Bernstein. 2006. Structural text search and comparison using automatically extracted schema.
- Text and structured data fusion in data tamer at scale. In 2014 IEEE 30th International Conference on Data Engineering. 1258–1261. https://doi.org/10.1109/ICDE.2014.6816755
- James R. Hamilton and Tapas K. Nayak. 2001. Microsoft SQL Server Full-Text Search. IEEE Data Eng. Bull. 24, 4 (2001), 7–10. http://sites.computer.org/debull/A01DEC-CD.pdf
- WannaDB: Ad-hoc SQL Queries over Text Collections. In Datenbanksysteme für Business, Technologie und Web (BTW 2023), 20. Fachtagung des GI-Fachbereichs ,,Datenbanken und Informationssysteme” (DBIS), 06.-10, März 2023, Dresden, Germany, Proceedings (LNI), Birgitta König-Ries, Stefanie Scherzinger, Wolfgang Lehner, and Gottfried Vossen (Eds.), Vol. P-331. Gesellschaft für Informatik e.V., 157–181. https://doi.org/10.18420/BTW2023-08
- GitTables: A Large-Scale Corpus of Relational Tables. Proc. ACM Manag. Data 1, 1 (2023), 30:1–30:17. https://doi.org/10.1145/3588710
- TABBIE: Pretrained Representations of Tabular Data. In Proceedings of NAACL-HLT 2021. Association for Computational Linguistics, 3446–3456.
- Saehan Jo and Immanuel Trummer. 2023. Demonstration of ThalamusDB: Answering Complex SQL Queries with Natural Language Predicates on Multi-Modal Data. In Companion of the 2023 International Conference on Management of Data, SIGMOD/PODS 2023, Seattle, WA, USA, June 18-23, 2023, Sudipto Das, Ippokratis Pandis, K. Selçuk Candan, and Sihem Amer-Yahia (Eds.). ACM, 179–182. https://doi.org/10.1145/3555041.3589730
- SpanBERT: Improving Pre-training by Representing and Predicting Spans. Trans. Assoc. Comput. Linguistics 8 (2020), 64–77. https://doi.org/10.1162/tacl_a_00300
- SystemT: a system for declarative information extraction. SIGMOD Rec. 37 (2009), 7–13. https://api.semanticscholar.org/CorpusID:8749741
- End-to-end Neural Coreference Resolution. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Copenhagen, Denmark, 188–197. https://doi.org/10.18653/v1/D17-1018
- Pre-training via Paraphrasing. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (Eds.). https://proceedings.neurips.cc/paper/2020/hash/d6f1dd034aabde7657e6680444ceff62-Abstract.html
- BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault (Eds.). Association for Computational Linguistics, 7871–7880. https://doi.org/10.18653/V1/2020.ACL-MAIN.703
- VisualBERT: A Simple and Performant Baseline for Vision and Language. CoRR abs/1908.03557 (2019). arXiv:1908.03557 http://arxiv.org/abs/1908.03557
- Table-gpt: Table-tuned gpt for diverse table tasks. arXiv preprint arXiv:2310.09263 (2023).
- Subjective Databases. Proc. VLDB Endow. 12, 11 (2019), 1330–1343. https://doi.org/10.14778/3342263.3342271
- Deep Entity Matching with Pre-Trained Language Models. Proc. VLDB Endow. 14, 1 (2020), 50–60. https://doi.org/10.14778/3421424.3421431
- TAPEX: Table Pre-training via Learning a Neural SQL Executor. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net. https://openreview.net/forum?id=O50443AsCP
- RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR abs/1907.11692 (2019). arXiv:1907.11692 http://arxiv.org/abs/1907.11692
- Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net. https://openreview.net/forum?id=Bkg6RiCqY7
- ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett (Eds.). 13–23. https://proceedings.neurips.cc/paper/2019/hash/c74d97b01eae257e44aa9d5bade97baf-Abstract.html
- Can Foundation Models Wrangle Your Data? Proc. VLDB Endow. 16, 4 (2022), 738–746. https://doi.org/10.14778/3574245.3574258
- OpenAI. 2023. GPT-4 Technical Report. https://doi.org/10.48550/arXiv.2303.08774 arXiv:2303.08774 [cs]
- Unsupervised Multi-hop Question Answering by Question Generation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tür, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (Eds.). Association for Computational Linguistics, 5866–5880. https://doi.org/10.18653/v1/2021.naacl-main.469
- ToTTo: A Controlled Table-To-Text Generation Dataset. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Linguistics, 1173–1186. https://doi.org/10.18653/v1/2020.emnlp-main.89
- Deep Contextualized Word Representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), Marilyn A. Walker, Heng Ji, and Amanda Stent (Eds.). Association for Computational Linguistics, 2227–2237. https://doi.org/10.18653/v1/n18-1202
- KILT: a Benchmark for Knowledge Intensive Language Tasks. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tür, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (Eds.). Association for Computational Linguistics, 2523–2544. https://doi.org/10.18653/v1/2021.naacl-main.200
- STable: Table Generation Framework for Encoder-Decoder Models. CoRR abs/2206.04045 (2022). https://doi.org/10.48550/arXiv.2206.04045 arXiv:2206.04045
- SQuAD: 100, 000+ Questions for Machine Comprehension of Text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, Jian Su, Xavier Carreras, and Kevin Duh (Eds.). The Association for Computational Linguistics, 2383–2392. https://doi.org/10.18653/V1/D16-1264
- Few-Shot Question Answering by Pretraining Span Selection. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (Eds.). Association for Computational Linguistics, 3066–3079. https://doi.org/10.18653/v1/2021.acl-long.239
- Querying Large Language Models with SQL. arXiv preprint arXiv:2304.00472 (2023).
- Declarative Information Extraction Using Datalog with Embedded Extraction Predicates. In Proceedings of the 33rd International Conference on Very Large Data Bases, University of Vienna, Austria, September 23-27, 2007, Christoph Koch, Johannes Gehrke, Minos N. Garofalakis, Divesh Srivastava, Karl Aberer, Anand Deshpande, Daniela Florescu, Chee Yong Chan, Venkatesh Ganti, Carl-Christian Kanne, Wolfgang Klas, and Erich J. Neuhold (Eds.). ACM, 1033–1044. http://www.vldb.org/conf/2007/papers/research/p1033-shen.pdf
- Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021. AAAI Press, 13806–13814. https://ojs.aaai.org/index.php/AAAI/article/view/17627
- Incremental Knowledge Base Construction Using DeepDive. Proc. VLDB Endow. 8, 11 (2015), 1310–1321. https://doi.org/10.14778/2809974.2809991
- VL-BERT: Pre-training of Generic Visual-Linguistic Representations. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net. https://openreview.net/forum?id=SygXPaEYvH
- Hao Tan and Mohit Bansal. 2019. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (Eds.). Association for Computational Linguistics, 5099–5110. https://doi.org/10.18653/v1/D19-1514
- Incremental information extraction using relational databases. IEEE Transactions on Knowledge and Data Engineering 24, 1 (2010), 86–99.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023).
- Database reasoning over text. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (Eds.). Association for Computational Linguistics, 3091–3104. https://doi.org/10.18653/v1/2021.acl-long.241
- LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971 [cs]
- Llama 2: Open Foundation and Fine-Tuned Chat Models. CoRR abs/2307.09288 (2023). https://doi.org/10.48550/ARXIV.2307.09288 arXiv:2307.09288
- Immanuel Trummer. 2022. DB-BERT: A Database Tuning Tool That ”Reads the Manual”. In Proceedings of the 2022 International Conference on Management of Data. ACM, Philadelphia PA USA, 190–203. https://doi.org/10.1145/3514221.3517843
- Matthias Urban and Carsten Binnig. 2024a. CAESURA: Language Models as Multi-Modal Query Planners. In 14th Conference on Innovative Data Systems Research, CIDR 2024, Chaminade, CA, USA, January 14-17, 2024. www.cidrdb.org. https://www.cidrdb.org/cidr2024/papers/p14-urban.pdf
- Matthias Urban and Carsten Binnig. 2024b. ELEET: Efficient Learned Query Execution over Text and Tables. Proc. VLDB Endow. 17, 13 (2024), XXXX–XXXX. https://www.vldb.org/pvldb/vol17/xxxx.pdf
- OmniscientDB: A Large Language Model-Augmented DBMS That Knows What Other DBMSs Do Not Know. In Proceedings of the Sixth International Workshop on Exploiting Artificial Intelligence Techniques for Data Management (Seattle, WA, USA) (aiDM ’23). Association for Computing Machinery, New York, NY, USA, Article 4, 7 pages. https://doi.org/10.1145/3593078.3593933
- Towards Foundation Models for Relational Databases [Vision Paper]. arXiv:2305.15321 [cs]
- Machop: an end-to-end generalized entity matching framework. In aiDM ’22: Proceedings of the Fifth International Workshop on Exploiting Artificial Intelligence Techniques for Data Management, Philadelphia, Pennsylvania, USA, 17 June 2022, Rajesh Bordawekar, Oded Shmueli, Yael Amsterdamer, Donatella Firmani, and Ryan Marcus (Eds.). ACM, 2:1–2:10. https://doi.org/10.1145/3533702.3534910
- TUTA: Tree-based Transformers for Generally Structured Table Pre-training. In KDD ’21: The 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, Singapore, August 14-18, 2021, Feida Zhu, Beng Chin Ooi, and Chunyan Miao (Eds.). ACM, 1780–1790. https://doi.org/10.1145/3447548.3467434
- Emergent Abilities of Large Language Models. https://doi.org/10.48550/arXiv.2206.07682 arXiv:2206.07682 [cs]
- Challenges in Data-to-Document Generation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, Martha Palmer, Rebecca Hwa, and Sebastian Riedel (Eds.). Association for Computational Linguistics, 2253–2263. https://doi.org/10.18653/v1/d17-1239
- Text-to-Table: A New Way of Information Extraction. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for Computational Linguistics, 2518–2533. https://doi.org/10.18653/v1/2022.acl-long.180
- A Linear DBSCAN Algorithm Based on LSH. In 2007 International Conference on Machine Learning and Cybernetics, Vol. 5. 2608–2614. https://doi.org/10.1109/ICMLC.2007.4370588
- TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault (Eds.). Association for Computational Linguistics, 8413–8426. https://doi.org/10.18653/v1/2020.acl-main.745
- GraPPa: Grammar-Augmented Pre-Training for Table Semantic Parsing. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net. https://openreview.net/forum?id=kyaIeYj4zZ
- Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015. IEEE Computer Society, 19–27. https://doi.org/10.1109/ICCV.2015.11
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.