- The paper demonstrates Aryn’s innovative use of LLMs combined with human-in-the-loop oversight to enable complex semantic analytics.
- It introduces a multi-component architecture that partitions documents, processes data with a distributed engine, and translates natural language queries into executable plans.
- The design overcomes traditional RAG limitations by integrating vision models and explainability features to enhance enterprise-grade analytics.
The paper "The Design of an LLM-powered Unstructured Analytics System" (2409.00847) describes Aryn, an end-to-end system designed to enable complex, semantic analytics over large collections of unstructured documents using LLMs. The authors argue that while LLMs are powerful, existing approaches like simple Retrieval-Augmented Generation (RAG) are insufficient for enterprise use cases that require complex reasoning across many documents with varied structures.
Problem and Motivation
Enterprises increasingly need to perform analytical queries on unstructured data like PDFs, reports, and transcripts. These queries go beyond simple search ("hunt and peck") to involve "sweep and harvest" patterns (analyzing information across many documents, then synthesizing) and "data integration" patterns (combining unstructured data with structured data). Existing RAG systems face limitations including:
- Limited LLM Context: LLMs struggle to synthesize information across vast document collections due to context window constraints and the "lost in the middle" phenomenon [lost-in-middle-2020].
- Embedding Model Limitations: Embedding models, crucial for RAG retrieval, may not accurately capture semantic relationships needed for complex queries, especially those involving time, hierarchy, or categories. Accuracy deteriorates with more data as vectors become less discriminative.
- Handling Complex Documents: RAG typically relies on simple text extraction, which fails to preserve the structure and semantics of complex documents containing tables, figures, and images.
Aryn's Design Principles (Tenets)
To address these challenges, Aryn is built on three key tenets:
- Leverage LLMs Strategically: Use LLMs for tasks they excel at (semantic analysis, extraction, planning) but complement them with human expertise for validation and refinement.
- Incorporate Visual Models: Utilize vision models to better understand document structure and extract information from non-textual elements like tables and images.
- Ensure Explainability: Provide users with visibility into how answers are computed, including the query plan and data provenance, to build trust and facilitate debugging.
Architecture Overview
Aryn's architecture consists of three main components (Figure 1):
- Aryn Partitioner: Processes raw documents (like PDFs) to extract structure and content, transforming them into a structured format.
- Sycamore: A declarative document processing engine built on Ray, managing distributed collections of documents called DocSets. It handles ETL, enrichment, and analytical transformations.
- Luna: The query layer that translates natural language queries into semantic query plans, executes them using Sycamore, and provides traceability.
Aryn Partitioner
This is the first step in the pipeline (Section 4). It segments documents into logical chunks (paragraphs, tables, images) and labels them. The Aryn Partitioner utilizes a custom vision model based on Deformable DETR [zhu2020deformable] trained on DocLayNet [doclaynet] for improved accuracy compared to off-the-shelf tools. Beyond basic segmentation, it includes:
- Table Extraction: Uses models like Table Transformer [table-transformer-2022] to identify cell bounding boxes and link them to text content.
- Image Processing: Uses multi-modal LLMs to extract metadata and summarize image contents.
- OCR: Applies OCR models (like EasyOCR) to identified text regions within images or scanned documents.
The output is structured data representing the document's layout and content, consumable as JSON or by Sycamore.
Sycamore
Sycamore is the core document processing engine (Section 5). It's a distributed dataflow system conceptually similar to Apache Spark [zaharia2012spark] but tailored for unstructured, multi-modal documents.
- Data Model: Documents are represented as hierarchical "semantic trees." Each node has content, ordered children, and JSON properties. Leaf nodes are "elements" corresponding to chunks like paragraphs, tables, or images, with type-specific properties (e.g., table rows/columns). DocSets are reliable, distributed collections of these hierarchical documents.
- Programming Model: Users write Python scripts using functional transformations on DocSets (Table 1). Transformations are categorized:
- Core: Standard operations like
map
, filter
, flat_map
.
- Structural: Modify document hierarchy (
partition
, explode
).
- Analytic: Operate on document properties (
reduce_by_key
, sort
).
- LLM-powered: Leverage LLMs for enrichment and analysis (
LLM_query
, extract_properties
, summarize
, embed
). These transforms handle details like prompting, retries, and parsing LLM output.
- Execution: Sycamore uses a lazy evaluation model and is built on the Ray compute framework [moritz2018ray] for distributed processing, leveraging Ray's Python integration.
An example Sycamore script demonstrates reading NTSB data, partitioning it, extracting properties using an LLM (
extract_properties
with GPT-4o), exploding documents into chunks, and embedding chunks (embed
with text-embedding-3-small) for vector indexing (Figure 6).
1
2
3
4
5
6
7
8
9
10
|
schema = {
us_state": "string", "probable_cause": "string", "weather_related": "bool
}
ds = context.read.binary("/path/to/ntsb_data")
.partition(ArynPartitioner())
.extract_properties(
OpenAIPropertyExtractor("gpt-4o", schema=schema))
.explode()
.embed(OpenAIEmbedder("text-embedding-3-small")) |
Luna
Luna is the natural language query layer for unstructured analytics (Section 6). It aims to provide a declarative interface over complex data, analogous to SQL for relational databases.
- Goal: Allow users to ask complex analytical questions in natural language without needing to understand the underlying data structure or Sycamore primitives.
- Human-in-the-Loop: Unlike fully automated systems, Luna exposes the query plan and execution trace, allowing users to inspect, validate, and modify the process, addressing potential misinterpretations of ambiguous queries.
- Architecture:
- Query Operators: Combines traditional operators (count, aggregate) with semantic, LLM-powered operators (
LLMFilter
, LLMExtract
).
- Data Schema: Uses the structured data and extracted properties from Sycamore to inform query planning.
- Natural Language Interface: Accepts user queries in NL and presents plans/results in NL.
- Query Planning: An LLM translates the NL query into a JSON representation of a DAG of operations using a fixed set of operators.
- Plan Optimization: Optimizes the generated plan based on cost (latency, computation, monetary) and efficiency, choosing appropriate techniques and models.
- Execution: Translates the optimized plan into Sycamore Python code, which is executed on Ray.
- Traceability and Debugging: Provides detailed logs, execution history, and APIs for users to understand and modify the query execution process.
- A sample query ("What percent of environmentally caused incidents were due to wind?") is translated into a plan involving filtering, counting, and a mathematical operation (Figure 9), resulting in Sycamore code:
1
2
3
4
5
6
|
out_0 = context.read.opensearch(index_name="ntsb")
out_1 = out_0.filter("caused by environmental factors")
out_2 = out_1.count()
out_3 = out_0.filter("caused by wind")
out_4 = out_3.count()
result = math_operation(expr="100 * {out_4}/{out_2}") |
Related Work
The paper contrasts Aryn with related efforts, including NL-to-SQL systems, LLMs for ETL tasks, knowledge graph construction, and other end-to-end LLM-powered data systems (ZenDB, LOTUS, etc.). Aryn is differentiated by its focus on hierarchical documents, deep integration of vision models, a comprehensive set of LLM-powered operations for both ETL and analytics, and its explicit design for a human-in-the-loop workflow. The Partitioner specifically builds upon document segmentation research using DETR variants but emphasizes higher fidelity for enterprise documents.
Conclusions and Future Work
Aryn is presented as a system unifying unstructured and structured analytics by leveraging LLMs' potential while acknowledging their current limitations through a human-in-the-loop design. The authors see challenges remaining in developing more robust data abstractions (potentially Knowledge Graphs), building scalable systems for large multi-modal data, and seamlessly integrating with existing structured data sources. The system is open source and available on GitHub.