Towards Accurate and Efficient Document Analytics with Large Language Models (2405.04674v1)
Abstract: Unstructured data formats account for over 80% of the data currently stored, and extracting value from such formats remains a considerable challenge. In particular, current approaches for managing unstructured documents do not support ad-hoc analytical queries on document collections. Moreover, LLMs directly applied to the documents themselves, or on portions of documents through a process of Retrieval-Augmented Generation (RAG), fail to provide high accuracy query results, and in the LLM-only case, additionally incur high costs. Since many unstructured documents in a collection often follow similar templates that impart a common semantic structure, we introduce ZenDB, a document analytics system that leverages this semantic structure, coupled with LLMs, to answer ad-hoc SQL queries on document collections. ZenDB efficiently extracts semantic hierarchical structures from such templatized documents, and introduces a novel query engine that leverages these structures for accurate and cost-effective query execution. Users can impose a schema on their documents, and query it, all via SQL. Extensive experiments on three real-world document collections demonstrate ZenDB's benefits, achieving up to 30% cost savings compared to LLM-based baselines, while maintaining or improving accuracy, and surpassing RAG-based baselines by up to 61% in precision and 80% in recall, at a marginally higher cost.
- 2019. https://www.forbes.com/sites/rkulkarni/2019/02/07/big-data-goes-big/?sh=45b1c73420d7.
- 2021. https://mitsloan.mit.edu/ideas-made-to-matter/tapping-power-unstructured-data.
- 2023. gemini.google.com.
- 2023. https://cloud.google.com/document-ai?hl=en.
- 2023. https://openai.com/pricing.
- 2023. https://www.anthropic.com/news/claude-3-family.
- 2023. https://www.forbes.com/sites/stevemcdowell/2023/03/09/komprise-unleashes-fresh-insights-about-your-unstructured-data/?sh=5f444c474aa9.
- 2023. https://www.llamaindex.ai/.
- 2023. https://www.nltk.org/.
- 2024. http://personal-informatics.depstein.net.
- 2024. https://primis.phmsa.dot.gov/enforcement-data/cases/NOPV.
- 2024. https://pypi.org/project/pdfplumber/0.1.2/.
- 2024. https://www.malibucity.org/AgendaCenter.
- Serge Abiteboul. 1997. Querying semi-structured data. In Database Theory—ICDT’97: 6th International Conference Delphi, Greece, January 8–10, 1997 Proceedings 6. Springer, 1–18.
- Data on the web: from relations to semistructured data and XML. Morgan Kaufmann.
- Eugene Agichtein and Luis Gravano. 2000. Snowball: Extracting relations from large plain-text collections. In Proceedings of the fifth ACM conference on Digital libraries. 85–94.
- Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes. Proceedings of the VLDB Endowment 17, 2 (2023), 92–105.
- Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508 (2023).
- Recent advances in retrieval-augmented text generation. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 3417–3419.
- Dr. spider: A diagnostic evaluation benchmark towards text-to-sql robustness. arXiv preprint arXiv:2301.08881 (2023).
- Shreddr: pipelined paper digitization for low-resource organizations. In Proceedings of the 2nd ACM Symposium on Computing for Development. 1–10.
- Seed: Simple, efficient, and effective data management via large language models. arXiv preprint arXiv:2310.00749 (2023).
- Symphony: Towards natural language query answering over multi-modal data lakes. In Conference on Innovative Data Systems Research, CIDR. 8–151.
- Observatory: Characterizing Embeddings of Relational Tables. arXiv preprint arXiv:2310.07736 (2023).
- Turl: Table understanding through representation learning. ACM SIGMOD Record 51, 1 (2022), 33–40.
- DeepJoin: Joinable Table Discovery with Pre-trained Language Models. arXiv preprint arXiv:2212.07588 (2022).
- Large Language Models on Tabular Data–A Survey. arXiv preprint arXiv:2402.17944 (2024).
- How large language models will disrupt data management. Proceedings of the VLDB Endowment 16, 11 (2023), 3302–3309.
- Boris Glavic et al. 2021. Data provenance. Foundations and Trends® in Databases 9, 3-4 (2021), 209–441.
- WannaDB: Ad-hoc SQL Queries over Text Collections. In BTW 2023. Gesellschaft für Informatik eV, 157–181.
- Joseph M Hellerstein and Michael Stonebraker. 1993. Predicate migration: Optimizing queries with expensive predicates. In Proceedings of the 1993 ACM SIGMOD international conference on Management of data. 267–276.
- Zezhou Huang and Eugene Wu. 2024. Cocoon: Semantic Table Profiling Using Large Language Models. arXiv preprint arXiv:2404.12552 (2024).
- Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022).
- Saehan Jo and Immanuel Trummer. 2023. Demonstration of ThalamusDB: Answering Complex SQL Queries with Natural Language Predicates on Multi-Modal Data. In Companion of the 2023 International Conference on Management of Data. 179–182.
- CHORUS: foundation models for unified data discovery and exploration. arXiv preprint arXiv:2306.09610 (2023).
- Mei Kobayashi and Koichi Takeda. 2000. Information retrieval on the web. ACM computing surveys (CSUR) 32, 2 (2000), 144–173.
- Nicholas Kushmerick. 2000. Wrapper induction: Efficiency and expressiveness. Artificial intelligence 118, 1-2 (2000), 15–68.
- Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems 33 (2020), 9459–9474.
- Fei Li and Hosagrahar V Jagadish. 2014. NaLIR: an interactive natural language interface for querying relational databases. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data. 709–712.
- A survey on retrieval-augmented text generation. arXiv preprint arXiv:2202.01110 (2022).
- Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls. Advances in Neural Information Processing Systems 36 (2024).
- Deep entity matching with pre-trained language models. arXiv preprint arXiv:2004.00584 (2020).
- Jie Liu and Barzan Mozafari. 2024. Query Rewriting via Large Language Models. arXiv preprint arXiv:2403.09060 (2024).
- Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics 12 (2024), 157–173.
- Tomohiro Manabe and Keishi Tajima. 2015. Extracting logical hierarchical structure of HTML documents based on headings. Proceedings of the VLDB Endowment 8, 12 (2015), 1606–1617.
- Lore: A database management system for semistructured data. ACM Sigmod Record 26, 3 (1997), 54–66.
- Can foundation models wrangle your data? arXiv preprint arXiv:2205.09911 (2022).
- Revisiting prompt engineering via declarative crowdsourcing. arXiv preprint arXiv:2308.03854 (2023).
- Natural language interfaces to data. Foundations and Trends® in Databases 11, 4 (2022), 319–414.
- Querying large language models with SQL. arXiv preprint arXiv:2304.00472 (2023).
- RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval. In The Twelfth International Conference on Learning Representations.
- Amit Singhal et al. 2001. Modern information retrieval: A brief overview. IEEE Data Eng. Bull. 24, 4 (2001), 35–43.
- Struc-Bench: Are Large Language Models Really Good at Generating Complex Structured Data? arXiv preprint arXiv:2309.08963 (2023).
- Database reasoning over text. arXiv preprint arXiv:2106.01074 (2021).
- From natural language processing to neural databases. In Proceedings of the VLDB Endowment, Vol. 14. VLDB Endowment, 1033–1039.
- Immanuel Trummer. 2022. DB-BERT: a Database Tuning Tool that” Reads the Manual”. In Proceedings of the 2022 international conference on management of data. 190–203.
- Matthias Urban and Carsten Binnig. 2023. Towards Multi-Modal DBMSs for Seamless Querying of Texts and Tables. arXiv preprint arXiv:2304.13559 (2023).
- OmniscientDB: a large language model-augmented DBMS that knows what other DBMSs do not know. In Proceedings of the Sixth International Workshop on Exploiting Artificial Intelligence Techniques for Data Management. 1–7.
- Learning to Filter Context for Retrieval-Augmented Generation. arXiv preprint arXiv:2311.08377 (2023).
- Text-to-table: A new way of information extraction. arXiv preprint arXiv:2109.02707 (2021).
- TaBERT: Pretraining for joint understanding of textual and tabular data. arXiv preprint arXiv:2005.08314 (2020).
- Large language models as data preprocessors. arXiv preprint arXiv:2308.16361 (2023).