Papers
Topics
Authors
Recent
Search
2000 character limit reached

The Design of an LLM-powered Unstructured Analytics System

Published 1 Sep 2024 in cs.DB, cs.AI, and cs.IR | (2409.00847v3)

Abstract: LLMs demonstrate an uncanny ability to process unstructured data, and as such, have the potential to go beyond search and run complex, semantic analyses at scale. We describe the design of an unstructured analytics system, Aryn, and the tenets and use cases that motivate its design. With Aryn, users specify queries in natural language and the system automatically determines a semantic plan and executes it to compute an answer from a large collection of unstructured documents. At the core of Aryn is Sycamore, a declarative document processing engine, that provides a reliable distributed abstraction called DocSets. Sycamore allows users to analyze, enrich, and transform complex documents at scale. Aryn includes Luna, a query planner that translates natural language queries to Sycamore scripts, and DocParse, which takes raw PDFs and document images, and converts them to DocSets for downstream processing. We show how these pieces come together to achieve better accuracy than RAG on analytics queries over real world reports from the National Transportation Safety Board (NTSB). Also, given current limitations of LLMs, we argue that an analytics system must provide explainability to be practical, and show how Aryn's user interface does this to help build trust.

Citations (2)

Summary

  • The paper introduces an LLM-powered system that converts natural language queries into detailed semantic analytics pipelines.
  • It integrates a vision-based partitioner, distributed document engine, and a query planner to overcome traditional RAG limitations.
  • The system demonstrates robust performance with superior document segmentation and a 72% accuracy in query planning on real-world data.

Aryn: An LLM-Powered Unstructured Analytics System

This paper introduces Aryn, an LLM-powered unstructured analytics system designed to perform complex semantic analyses on large collections of unstructured documents. Aryn leverages a declarative query processing approach, allowing users to specify queries in natural language, which the system then translates into a semantic plan for execution. The system incorporates several key components, including Sycamore, a distributed document processing engine; Luna, a query planner; and the Aryn Partitioner, which converts raw documents into a structured format suitable for analysis. The paper demonstrates Aryn's capabilities through a real-world use case involving the analysis of NTSB accident reports.

Use Cases, Challenges, and Design Tenets

The paper highlights several enterprise use cases that motivate the development of Aryn. These include building AI assistants for customer support, creating Q&A systems for manufacturing, summarizing patient records, conducting financial analysis, and analyzing legal documents. The key challenge addressed is the limitation of traditional RAG approaches, which struggle with complex reasoning, large datasets, and documents containing tables, graphs, and images. The paper argues that RAG accuracy degrades quickly as one asks more complex questions, adds more data, or works with more complex data. To overcome these limitations, the design of Aryn is guided by three tenets: leveraging LLMs for their strengths while incorporating human expertise, using visual models to enhance document understanding, and ensuring explainability of results.

System Architecture and Components

Figure 1

Figure 1: A high-level architectural overview of the Aryn system, illustrating the interplay between the Aryn Partitioner, Sycamore, and Luna.

Aryn's architecture (Figure 1) comprises three primary components: the Aryn Partitioner, Sycamore, and Luna.

Aryn Partitioner

The Aryn Partitioner is responsible for segmenting raw documents into structured components. It employs a vision-based document segmentation model based on the Deformable DETR architecture, trained on the DocLayNet dataset. The model achieves a mean average precision (mAP) of 0.602 and a mean average recall (mAR) of 0.743 on the DocLayNet competition benchmark, outperforming a document API from a large cloud vendor, which achieved only an mAP of 0.344 with an mAR of 0.466. The partitioner also incorporates table extraction using the Table Transformer model, image processing, and OCR capabilities. Figure 2

Figure 2: An example of the Aryn Partitioner's output on an NTSB accident report, showcasing the identification of tables and cells.

Sycamore

Sycamore is a document processing engine built on DocSets, which are reliable distributed collections of hierarchical documents represented as semantic trees. It provides a set of transformations for ETL and analytics, including flattening, embedding, filtering, summarizing, and extracting information. Sycamore integrates with LLMs to power many of these transformations, with lineage tracking for debugging. The system adopts a Spark-like execution model, with operations pipelined and executed lazily. Sycamore is built on top of the Ray compute framework, which provides primitives for distributing Python objects.

Luna

Luna is the query layer that translates natural language questions into semantic query plans. It uses LLMs to construct query plans that users can inspect and validate, providing explainability and enabling debugging. Luna incorporates a combination of traditional data-processing operators (count, aggregate, join) and semantic operators (llmFilter, llmExtract). The framework achieved a 72% accuracy on a micro-benchmark using questions from financial customers on an earnings report dataset, and questions for the NTSB reports. Figure 3

Figure 3: A visual representation of Luna's query execution workflow, illustrating the steps involved in processing a natural language query.

The paper references existing work on natural language to SQL translation, LLM-based query generation, and the use of LLMs for specific ETL tasks. It differentiates Aryn by its broader set of LLM-based operations, focus on hierarchical datasets, and emphasis on interactive interfaces. The authors position Aryn within the landscape of end-to-end systems like ZenDB, LOTUS, EVAPORATE, CHORUS and Palimpzest, emphasizing Aryn's human-in-the-loop paradigm. The Aryn Partitioner is contextualized within research on document segmentation and object detection models like DETR and Deformable DETR, as well as multi-modal models such as Donut and LayoutLMV3.

Conclusion

Aryn represents a significant step toward unifying unstructured and structured analytics by leveraging LLMs to process multi-modal datasets. The system's human-in-the-loop design acknowledges the limitations of current models while providing a path for future improvements. Key areas for future work include developing more robust data and querying abstractions, exploring knowledge graphs, and building more scalable systems for processing large multi-modal repositories.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.