Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Towards Accurate and Efficient Document Analytics with Large Language Models (2405.04674v1)

Published 7 May 2024 in cs.DB

Abstract: Unstructured data formats account for over 80% of the data currently stored, and extracting value from such formats remains a considerable challenge. In particular, current approaches for managing unstructured documents do not support ad-hoc analytical queries on document collections. Moreover, LLMs directly applied to the documents themselves, or on portions of documents through a process of Retrieval-Augmented Generation (RAG), fail to provide high accuracy query results, and in the LLM-only case, additionally incur high costs. Since many unstructured documents in a collection often follow similar templates that impart a common semantic structure, we introduce ZenDB, a document analytics system that leverages this semantic structure, coupled with LLMs, to answer ad-hoc SQL queries on document collections. ZenDB efficiently extracts semantic hierarchical structures from such templatized documents, and introduces a novel query engine that leverages these structures for accurate and cost-effective query execution. Users can impose a schema on their documents, and query it, all via SQL. Extensive experiments on three real-world document collections demonstrate ZenDB's benefits, achieving up to 30% cost savings compared to LLM-based baselines, while maintaining or improving accuracy, and surpassing RAG-based baselines by up to 61% in precision and 80% in recall, at a marginally higher cost.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (62)
  1. 2019. https://www.forbes.com/sites/rkulkarni/2019/02/07/big-data-goes-big/?sh=45b1c73420d7.
  2. 2021. https://mitsloan.mit.edu/ideas-made-to-matter/tapping-power-unstructured-data.
  3. 2023. gemini.google.com.
  4. 2023. https://cloud.google.com/document-ai?hl=en.
  5. 2023. https://openai.com/pricing.
  6. 2023. https://www.anthropic.com/news/claude-3-family.
  7. 2023. https://www.forbes.com/sites/stevemcdowell/2023/03/09/komprise-unleashes-fresh-insights-about-your-unstructured-data/?sh=5f444c474aa9.
  8. 2023. https://www.llamaindex.ai/.
  9. 2023. https://www.nltk.org/.
  10. 2024. http://personal-informatics.depstein.net.
  11. 2024. https://primis.phmsa.dot.gov/enforcement-data/cases/NOPV.
  12. 2024. https://pypi.org/project/pdfplumber/0.1.2/.
  13. 2024. https://www.malibucity.org/AgendaCenter.
  14. Serge Abiteboul. 1997. Querying semi-structured data. In Database Theory—ICDT’97: 6th International Conference Delphi, Greece, January 8–10, 1997 Proceedings 6. Springer, 1–18.
  15. Data on the web: from relations to semistructured data and XML. Morgan Kaufmann.
  16. Eugene Agichtein and Luis Gravano. 2000. Snowball: Extracting relations from large plain-text collections. In Proceedings of the fifth ACM conference on Digital libraries. 85–94.
  17. Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes. Proceedings of the VLDB Endowment 17, 2 (2023), 92–105.
  18. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508 (2023).
  19. Recent advances in retrieval-augmented text generation. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 3417–3419.
  20. Dr. spider: A diagnostic evaluation benchmark towards text-to-sql robustness. arXiv preprint arXiv:2301.08881 (2023).
  21. Shreddr: pipelined paper digitization for low-resource organizations. In Proceedings of the 2nd ACM Symposium on Computing for Development. 1–10.
  22. Seed: Simple, efficient, and effective data management via large language models. arXiv preprint arXiv:2310.00749 (2023).
  23. Symphony: Towards natural language query answering over multi-modal data lakes. In Conference on Innovative Data Systems Research, CIDR. 8–151.
  24. Observatory: Characterizing Embeddings of Relational Tables. arXiv preprint arXiv:2310.07736 (2023).
  25. Turl: Table understanding through representation learning. ACM SIGMOD Record 51, 1 (2022), 33–40.
  26. DeepJoin: Joinable Table Discovery with Pre-trained Language Models. arXiv preprint arXiv:2212.07588 (2022).
  27. Large Language Models on Tabular Data–A Survey. arXiv preprint arXiv:2402.17944 (2024).
  28. How large language models will disrupt data management. Proceedings of the VLDB Endowment 16, 11 (2023), 3302–3309.
  29. Boris Glavic et al. 2021. Data provenance. Foundations and Trends® in Databases 9, 3-4 (2021), 209–441.
  30. WannaDB: Ad-hoc SQL Queries over Text Collections. In BTW 2023. Gesellschaft für Informatik eV, 157–181.
  31. Joseph M Hellerstein and Michael Stonebraker. 1993. Predicate migration: Optimizing queries with expensive predicates. In Proceedings of the 1993 ACM SIGMOD international conference on Management of data. 267–276.
  32. Zezhou Huang and Eugene Wu. 2024. Cocoon: Semantic Table Profiling Using Large Language Models. arXiv preprint arXiv:2404.12552 (2024).
  33. Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022).
  34. Saehan Jo and Immanuel Trummer. 2023. Demonstration of ThalamusDB: Answering Complex SQL Queries with Natural Language Predicates on Multi-Modal Data. In Companion of the 2023 International Conference on Management of Data. 179–182.
  35. CHORUS: foundation models for unified data discovery and exploration. arXiv preprint arXiv:2306.09610 (2023).
  36. Mei Kobayashi and Koichi Takeda. 2000. Information retrieval on the web. ACM computing surveys (CSUR) 32, 2 (2000), 144–173.
  37. Nicholas Kushmerick. 2000. Wrapper induction: Efficiency and expressiveness. Artificial intelligence 118, 1-2 (2000), 15–68.
  38. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems 33 (2020), 9459–9474.
  39. Fei Li and Hosagrahar V Jagadish. 2014. NaLIR: an interactive natural language interface for querying relational databases. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data. 709–712.
  40. A survey on retrieval-augmented text generation. arXiv preprint arXiv:2202.01110 (2022).
  41. Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls. Advances in Neural Information Processing Systems 36 (2024).
  42. Deep entity matching with pre-trained language models. arXiv preprint arXiv:2004.00584 (2020).
  43. Jie Liu and Barzan Mozafari. 2024. Query Rewriting via Large Language Models. arXiv preprint arXiv:2403.09060 (2024).
  44. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics 12 (2024), 157–173.
  45. Tomohiro Manabe and Keishi Tajima. 2015. Extracting logical hierarchical structure of HTML documents based on headings. Proceedings of the VLDB Endowment 8, 12 (2015), 1606–1617.
  46. Lore: A database management system for semistructured data. ACM Sigmod Record 26, 3 (1997), 54–66.
  47. Can foundation models wrangle your data? arXiv preprint arXiv:2205.09911 (2022).
  48. Revisiting prompt engineering via declarative crowdsourcing. arXiv preprint arXiv:2308.03854 (2023).
  49. Natural language interfaces to data. Foundations and Trends® in Databases 11, 4 (2022), 319–414.
  50. Querying large language models with SQL. arXiv preprint arXiv:2304.00472 (2023).
  51. RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval. In The Twelfth International Conference on Learning Representations.
  52. Amit Singhal et al. 2001. Modern information retrieval: A brief overview. IEEE Data Eng. Bull. 24, 4 (2001), 35–43.
  53. Struc-Bench: Are Large Language Models Really Good at Generating Complex Structured Data? arXiv preprint arXiv:2309.08963 (2023).
  54. Database reasoning over text. arXiv preprint arXiv:2106.01074 (2021).
  55. From natural language processing to neural databases. In Proceedings of the VLDB Endowment, Vol. 14. VLDB Endowment, 1033–1039.
  56. Immanuel Trummer. 2022. DB-BERT: a Database Tuning Tool that” Reads the Manual”. In Proceedings of the 2022 international conference on management of data. 190–203.
  57. Matthias Urban and Carsten Binnig. 2023. Towards Multi-Modal DBMSs for Seamless Querying of Texts and Tables. arXiv preprint arXiv:2304.13559 (2023).
  58. OmniscientDB: a large language model-augmented DBMS that knows what other DBMSs do not know. In Proceedings of the Sixth International Workshop on Exploiting Artificial Intelligence Techniques for Data Management. 1–7.
  59. Learning to Filter Context for Retrieval-Augmented Generation. arXiv preprint arXiv:2311.08377 (2023).
  60. Text-to-table: A new way of information extraction. arXiv preprint arXiv:2109.02707 (2021).
  61. TaBERT: Pretraining for joint understanding of textual and tabular data. arXiv preprint arXiv:2005.08314 (2020).
  62. Large language models as data preprocessors. arXiv preprint arXiv:2308.16361 (2023).
Citations (2)

Summary

  • The paper introduces ZenDB, which extracts semantic hierarchical trees from documents to support ad-hoc SQL queries on unstructured data.
  • It achieves significant cost savings up to 30× and improves precision and recall by up to 61% and 80%, respectively.
  • The methodology integrates LLMs with structured query processing, setting the stage for scalable and efficient document analytics.

Overview of "Towards Accurate and Efficient Document Analytics with LLMs"

The paper presents ZenDB, a specialized document analytics system designed to tackle the challenges associated with querying unstructured document collections. It addresses a significant gap in existing approaches that employ LLMs and Retrieval-Augmented Generation (RAG) by leveraging the latent semantic structures often present in such documents.

Challenges in Document Querying

The paper begins by highlighting the prevalence of unstructured data formats and the difficulties encountered in extracting valuable information from them. The lack of support for ad-hoc analytical queries on unstructured documents poses a significant challenge. Existing methods using LLMs directly for document queries face limitations in accuracy, cost, and efficiency. The high cost of processing involved with LLMs, especially for tasks involving complex aggregations and filters on large document contexts, is a key barrier. RAG, though more cost-effective than LLMs, struggles with accurately selecting relevant text segments due to its limited understanding of document semantics.

The ZenDB Approach

ZenDB introduces a novel system that capitalizes on the semantic hierarchical structures inherent in many unstructured documents. It posits that documents created using similar templates impart a common, useful semantic structure that can be leveraged for querying. ZenDB efficiently extracts these semantic structures and incorporates them into SQL query processing, offering a system capable of imposing and querying schemas over document collections.

ZenDB's architecture is based on the notion of Semantic Hierarchical Trees (SHTs), which represent the semantic structure of a document as a tree. This representation allows the system to accurately map document sections to SQL schema components, facilitating efficient ad-hoc query execution. The paper provides evidence of ZenDB's significant advantages over traditional LLM and RAG approaches, with cost savings up to 30× compared to LLMs and improvements in precision and recall over RAG by up to 61% and 80%, respectively.

Methodologies and Results

The authors detail their methodology in transforming unstructured documents into SHTs, leveraging consistent visual patterns and LLMs to extract these semantic structures. They also outline their approach to schema definition, allowing users to specify and query document structures using standard SQL syntax extended for ZenDB’s purposes. The paper presents extensive experimentation across three real-world datasets, demonstrating ZenDB’s substantial cost reduction and accuracy maintenance or improvement over existing methods.

Implications and Future Directions

The practical implications of this research are profound, as ZenDB provides a scalable, efficient solution for businesses facing the challenge of obtaining structured insights from unstructured data. By effectively utilizing semantic document structures, ZenDB sets a precedent for future work in leveraging latent document semantics in data analytics.

Theoretically, this paper expands the limits of what is achievable with LLMs in data management by marrying these models with more traditional data processing structures. Future explorations might investigate how this approach can be adapted for documents lacking clear templates or extending beyond text to include other unstructured data forms like audio and video. Additionally, the development of more sophisticated methods to automatically recognize and leverage latent semantic patterns across a variety of document formats presents a promising avenue for research.

Overall, ZenDB contributes a significant advancement in the field of automated document analytics, offering an effective bridge between the capabilities of LLMs and the structured needs of SQL-based querying systems.

Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com