LatteReview Framework: Automated Review
- LatteReview is a Python-based multi-agent framework that automates literature screening, inclusion/exclusion decisions, and data extraction using LLMs.
- Its layered architecture features provider abstraction, dedicated reviewer agents, and a workflow engine that efficiently coordinates data processing and consensus.
- The framework ensures scalability and reliability through asynchronous execution, Pydantic-based validation, and support for multimodal inputs and retrieval-augmented generation.
LatteReview is a Python-based, multi-agent framework designed to automate and streamline the systematic review and meta-analysis process using LLMs. It addresses the labor-intensive tasks of literature screening, inclusion/exclusion decisions, relevance scoring, and structured data extraction by orchestrating modular reviewer agents and supporting both cloud-based and local LLM backends. The framework achieves extensibility, scalability, and rigorous validation through a combination of modular agents, flexible workflow orchestration, retrieval-augmented generation (RAG), multimodal input handling, and Pydantic-based structured I/O.
1. Architectural Overview and Foundations
LatteReview is centered on a layered architecture comprising a provider abstraction for LLM backends, a suite of reviewer agents for specialized tasks, and a workflow engine that coordinates reviews on tabular data sources. The core entities are:
- Providers Layer: Abstracts API access to both commercial and open-source LLMs (OpenAI, Gemini, Anthropic, Ollama, LiteLLM).
- Reviewer Agents: Encapsulate discrete review tasks (e.g., title/abstract screening, relevance scoring, key-value data abstraction). Agents are decoupled, modular, and orchestrated in sequential or parallel rounds.
- Workflow Engine: Orchestrates multi-round review pipelines, managing execution order, dynamic filtering, and aggregation of reviewer outputs on a Pandas DataFrame. Supports integration of both sequential and parallel review strategies.
The architecture realizes comprehensive modularity, allowing extensible orchestration of both built-in and user-defined reviewer agents, as well as compatibility with various model providers.
2. Agent Taxonomy and Modularity
LatteReview defines a distinct ontology of reviewer agent classes, each designed to perform a specialized function within systematic review automation:
- BaseReviewer: Abstract class responsible for prompt construction, I/O validation (typically via Pydantic), and provider interfacing.
- ScoringReviewer: Implements numerical scoring and returns structured outputs comprising score, reasoning, and certainty.
- TitleAbstractReviewer: Specialized for N-point (typically 5-level) include/exclude decisions against domain-specific inclusion/exclusion criteria; output is both a categorical recommendation and associated reasoning.
- AbstractionReviewer: Extracts structured key-value fields from free text (title, abstract, full-text, or multimodal) with user-specified abstraction keys.
- CustomReviewer: Allows users to subclass the basic reviewer interface to inject custom logic, prompts, validation, or new review modalities.
Reviewer agents are invoked per input item in a DataFrame, yielding additional columns with validated (often JSON-structured) outputs. Agents may operate on text alone, or jointly on text and images. Workflows can route outputs from parallel reviewers to consensus rounds or expert adjudication.
3. Workflow Engine and Process Orchestration
Workflows in LatteReview are defined as ordered schemas, with each round described by the following fields: round label, list of reviewers, input DataFrame columns, and optional filtering lambda. The engine processes each step as follows:
- Extracts inputs for the round.
- Dispatches all reviewer agents in parallel per input item, typically as concurrent API calls.
- Aggregates outputs and appends new columns to the DataFrame.
- Applies optional filters (e.g., flagging disagreements or low-certainty reviews for further rounds).
- Repeats for subsequent rounds.
The following illustrates the process (pseudocode):
1 2 3 4 5 6 7 8 9 10 11 12 13 |
def workflow(dataframe, workflow_schema): for step in workflow_schema: inputs = dataframe[ step["text_inputs"] ] # Parallel review outputs = parallel_map( lambda row: { reviewer.name: reviewer.review(row) for reviewer in step["reviewers"] }, inputs ) dataframe = dataframe.join(outputs) if "filter" in step and step["filter"]: dataframe = dataframe[ dataframe.apply(step["filter"], axis=1) ] return dataframe |
This procedural design supports consensus models, expert review of disagreements, and subsetting data for downstream abstraction rounds or quality control.
4. LLM Integration, Retrieval-Augmented Generation, and Multimodal Inputs
LatteReview facilitates integration with a spectrum of LLM backends by supporting the following providers:
- OpenAIProvider (including vision-capable GPT-4 variants)
- OllamaProvider (local Llama)
- LiteLLMProvider (unified interface for Gemini, Anthropic, etc.)
API credentials are provided by environment variable. Additionally, the framework enables retrieval-augmented generation (RAG): reviewer agents accept an additional_context parameter that may be a static string or a callable fetching dynamic, item-specific external knowledge (e.g., matching full-text sources, domain KBs). For multimodal reviews, agents can consume image bytes or paths alongside text, with appropriate routing to vision-capable models and validated file handling upstream; outputs from text and multimodal reviewers are unified into the final DataFrame schema.
5. I/O Validation, Type Safety, and Structured Output
All reviewers leverage Pydantic schemas for both their task inputs and outputs. This approach guarantees:
- Type safety: Structured types (e.g.,
score: int,reasoning: str,certainty: float) are enforced at every interface boundary. - Downstream interoperability: Outputs are JSON (when from LLMs) and converted to validated formats accessible via Pandas DataFrame columns and serialized outputs.
- Parsing reliability: Agents validate the model output with e.g.,
1 2 |
response = LLM.parse_json() validated = ScoringResponse(**response) |
This ensures that all intermediate and final outputs conform to explicit schemas suitable for use in downstream synthesis or meta-analytic workflows.
6. Scalability, Asynchronous Execution, and Performance
LatteReview leverages Python's asyncio and semaphore primitives for scalable concurrency, enabling high-throughput review of large datasets. Each agent implements asynchronous review of item sets, with global concurrency capped to comply with provider rate limits. Performance scales as:
Empirically, the framework processes 1,000 title/abstract pairs with two parallel junior + one expert reviewer in approximately 1 minute at a cost of \$1.20 (OpenAI Tier 4 pricing). Balanced strategies offer AUCs ranging from 0.77 to 0.95 on the SYNERGY Collection and up to 0.94 on a custom cardiothoracic imaging dataset, depending on inclusion/exclusion complexity.
7. Installation, Extensibility, and Applications
LatteReview is distributed as a pip-installable Python package with optional developer and comprehensive extras. Installation and basic usage involve minimal setup (API keys, workflow schema definitions), with clear extensibility pathways:
- Custom Agents: Subclass and register via the agent interface.
- Workflow Modularity: Configure arbitrarily complex pipelines combining parallel review, expert adjudication, RAG, image analysis, or custom extraction logic.
- Benchmarks and Case Studies: Demonstrated robust generalization across domains, inclusion/exclusion criteria, and multimodal data.
Applications span biomedical systematic reviews, technology horizon scanning, structured data abstraction, consensus scoring, and automated evidence synthesis at scale.
LatteReview systematically advances the automation of evidence synthesis, providing a modular, verifiable, and high-throughput platform for integrating LLM-based agents into rigorous review pipelines while maintaining extensibility and transparency for a range of research applications (Rouzrokh et al., 5 Jan 2025).