Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
153 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Design of an LLM-powered Unstructured Analytics System (2409.00847v3)

Published 1 Sep 2024 in cs.DB, cs.AI, and cs.IR

Abstract: LLMs demonstrate an uncanny ability to process unstructured data, and as such, have the potential to go beyond search and run complex, semantic analyses at scale. We describe the design of an unstructured analytics system, Aryn, and the tenets and use cases that motivate its design. With Aryn, users specify queries in natural language and the system automatically determines a semantic plan and executes it to compute an answer from a large collection of unstructured documents. At the core of Aryn is Sycamore, a declarative document processing engine, that provides a reliable distributed abstraction called DocSets. Sycamore allows users to analyze, enrich, and transform complex documents at scale. Aryn includes Luna, a query planner that translates natural language queries to Sycamore scripts, and DocParse, which takes raw PDFs and document images, and converts them to DocSets for downstream processing. We show how these pieces come together to achieve better accuracy than RAG on analytics queries over real world reports from the National Transportation Safety Board (NTSB). Also, given current limitations of LLMs, we argue that an analytics system must provide explainability to be practical, and show how Aryn's user interface does this to help build trust.

Citations (2)

Summary

  • The paper demonstrates Aryn’s innovative use of LLMs combined with human-in-the-loop oversight to enable complex semantic analytics.
  • It introduces a multi-component architecture that partitions documents, processes data with a distributed engine, and translates natural language queries into executable plans.
  • The design overcomes traditional RAG limitations by integrating vision models and explainability features to enhance enterprise-grade analytics.

The paper "The Design of an LLM-powered Unstructured Analytics System" (2409.00847) describes Aryn, an end-to-end system designed to enable complex, semantic analytics over large collections of unstructured documents using LLMs. The authors argue that while LLMs are powerful, existing approaches like simple Retrieval-Augmented Generation (RAG) are insufficient for enterprise use cases that require complex reasoning across many documents with varied structures.

Problem and Motivation

Enterprises increasingly need to perform analytical queries on unstructured data like PDFs, reports, and transcripts. These queries go beyond simple search ("hunt and peck") to involve "sweep and harvest" patterns (analyzing information across many documents, then synthesizing) and "data integration" patterns (combining unstructured data with structured data). Existing RAG systems face limitations including:

  1. Limited LLM Context: LLMs struggle to synthesize information across vast document collections due to context window constraints and the "lost in the middle" phenomenon [lost-in-middle-2020].
  2. Embedding Model Limitations: Embedding models, crucial for RAG retrieval, may not accurately capture semantic relationships needed for complex queries, especially those involving time, hierarchy, or categories. Accuracy deteriorates with more data as vectors become less discriminative.
  3. Handling Complex Documents: RAG typically relies on simple text extraction, which fails to preserve the structure and semantics of complex documents containing tables, figures, and images.

Aryn's Design Principles (Tenets)

To address these challenges, Aryn is built on three key tenets:

  1. Leverage LLMs Strategically: Use LLMs for tasks they excel at (semantic analysis, extraction, planning) but complement them with human expertise for validation and refinement.
  2. Incorporate Visual Models: Utilize vision models to better understand document structure and extract information from non-textual elements like tables and images.
  3. Ensure Explainability: Provide users with visibility into how answers are computed, including the query plan and data provenance, to build trust and facilitate debugging.

Architecture Overview

Aryn's architecture consists of three main components (Figure 1):

  • Aryn Partitioner: Processes raw documents (like PDFs) to extract structure and content, transforming them into a structured format.
  • Sycamore: A declarative document processing engine built on Ray, managing distributed collections of documents called DocSets. It handles ETL, enrichment, and analytical transformations.
  • Luna: The query layer that translates natural language queries into semantic query plans, executes them using Sycamore, and provides traceability.

Aryn Partitioner

This is the first step in the pipeline (Section 4). It segments documents into logical chunks (paragraphs, tables, images) and labels them. The Aryn Partitioner utilizes a custom vision model based on Deformable DETR [zhu2020deformable] trained on DocLayNet [doclaynet] for improved accuracy compared to off-the-shelf tools. Beyond basic segmentation, it includes:

  • Table Extraction: Uses models like Table Transformer [table-transformer-2022] to identify cell bounding boxes and link them to text content.
  • Image Processing: Uses multi-modal LLMs to extract metadata and summarize image contents.
  • OCR: Applies OCR models (like EasyOCR) to identified text regions within images or scanned documents. The output is structured data representing the document's layout and content, consumable as JSON or by Sycamore.

Sycamore

Sycamore is the core document processing engine (Section 5). It's a distributed dataflow system conceptually similar to Apache Spark [zaharia2012spark] but tailored for unstructured, multi-modal documents.

  • Data Model: Documents are represented as hierarchical "semantic trees." Each node has content, ordered children, and JSON properties. Leaf nodes are "elements" corresponding to chunks like paragraphs, tables, or images, with type-specific properties (e.g., table rows/columns). DocSets are reliable, distributed collections of these hierarchical documents.
  • Programming Model: Users write Python scripts using functional transformations on DocSets (Table 1). Transformations are categorized:
    • Core: Standard operations like map, filter, flat_map.
    • Structural: Modify document hierarchy (partition, explode).
    • Analytic: Operate on document properties (reduce_by_key, sort).
    • LLM-powered: Leverage LLMs for enrichment and analysis (LLM_query, extract_properties, summarize, embed). These transforms handle details like prompting, retries, and parsing LLM output.
  • Execution: Sycamore uses a lazy evaluation model and is built on the Ray compute framework [moritz2018ray] for distributed processing, leveraging Ray's Python integration. An example Sycamore script demonstrates reading NTSB data, partitioning it, extracting properties using an LLM (extract_properties with GPT-4o), exploding documents into chunks, and embedding chunks (embed with text-embedding-3-small) for vector indexing (Figure 6).

1
2
3
4
5
6
7
8
9
10
schema = {
us_state": "string", "probable_cause": "string", "weather_related": "bool
}

ds = context.read.binary("/path/to/ntsb_data")
.partition(ArynPartitioner())
.extract_properties(
OpenAIPropertyExtractor("gpt-4o", schema=schema))
.explode()
.embed(OpenAIEmbedder("text-embedding-3-small"))

Luna

Luna is the natural language query layer for unstructured analytics (Section 6). It aims to provide a declarative interface over complex data, analogous to SQL for relational databases.

  • Goal: Allow users to ask complex analytical questions in natural language without needing to understand the underlying data structure or Sycamore primitives.
  • Human-in-the-Loop: Unlike fully automated systems, Luna exposes the query plan and execution trace, allowing users to inspect, validate, and modify the process, addressing potential misinterpretations of ambiguous queries.
  • Architecture:
    • Query Operators: Combines traditional operators (count, aggregate) with semantic, LLM-powered operators (LLMFilter, LLMExtract).
    • Data Schema: Uses the structured data and extracted properties from Sycamore to inform query planning.
    • Natural Language Interface: Accepts user queries in NL and presents plans/results in NL.
    • Query Planning: An LLM translates the NL query into a JSON representation of a DAG of operations using a fixed set of operators.
    • Plan Optimization: Optimizes the generated plan based on cost (latency, computation, monetary) and efficiency, choosing appropriate techniques and models.
    • Execution: Translates the optimized plan into Sycamore Python code, which is executed on Ray.
    • Traceability and Debugging: Provides detailed logs, execution history, and APIs for users to understand and modify the query execution process.
    • A sample query ("What percent of environmentally caused incidents were due to wind?") is translated into a plan involving filtering, counting, and a mathematical operation (Figure 9), resulting in Sycamore code:

1
2
3
4
5
6
out_0 = context.read.opensearch(index_name="ntsb")
out_1 = out_0.filter("caused by environmental factors")
out_2 = out_1.count()
out_3 = out_0.filter("caused by wind")
out_4 = out_3.count()
result = math_operation(expr="100 * {out_4}/{out_2}")

Related Work

The paper contrasts Aryn with related efforts, including NL-to-SQL systems, LLMs for ETL tasks, knowledge graph construction, and other end-to-end LLM-powered data systems (ZenDB, LOTUS, etc.). Aryn is differentiated by its focus on hierarchical documents, deep integration of vision models, a comprehensive set of LLM-powered operations for both ETL and analytics, and its explicit design for a human-in-the-loop workflow. The Partitioner specifically builds upon document segmentation research using DETR variants but emphasizes higher fidelity for enterprise documents.

Conclusions and Future Work

Aryn is presented as a system unifying unstructured and structured analytics by leveraging LLMs' potential while acknowledging their current limitations through a human-in-the-loop design. The authors see challenges remaining in developing more robust data abstractions (potentially Knowledge Graphs), building scalable systems for large multi-modal data, and seamlessly integrating with existing structured data sources. The system is open source and available on GitHub.