CORGI Benchmark for BI Tasks

Updated 9 October 2025

CORGI Benchmark is a comprehensive text-to-SQL evaluation framework that simulates complex business intelligence tasks with causal reasoning, forecasting, and prescriptive queries.
It employs synthetic databases modeled on real-world platforms like Doordash, Airbnb, and Lululemon to recreate authentic BI scenarios with high schema complexity.
Empirical results reveal a 21% drop in execution success rate compared to prior benchmarks, underscoring current LLM limitations in advanced business reasoning.

The CORGI Benchmark is a comprehensive text-to-SQL evaluation framework specifically developed to assess the capabilities of LLMs in addressing complex business intelligence (BI) tasks. Unlike prior benchmarks that emphasize factual retrieval, CORGI introduces multifaceted, multi-agentic challenges inspired by real-world enterprises, requiring not only data retrieval but also causal reasoning, temporal forecasting, and prescriptive recommendation. These requirements reflect the operational and decision-making demands encountered in modern business analytics and management consulting.

1. Benchmark Design and Objectives

CORGI is structured to probe the upper bounds of LLMs in the context of nuanced business queries. The primary objective is to evaluate whether models can progress beyond simple historical data access to tackle explanatory, predictive, and strategic queries. This shift aims to reflect authentic BI workflows, where practitioners must diagnose trends, forecast outcomes, and design actionable plans based on SQL-driven data access.

CORGI distinguishes itself from existing benchmarks such as BIRD by both its domain focus and question complexity. Whereas BIRD largely covers generic entity and record retrieval, CORGI leverages synthetic databases populated to mimic enterprises such as Doordash, Airbnb, or Lululemon. Each enterprise simulation incorporates latent business logic, including user demographics and seasonal changes, resulting in more realistic data distributions and higher information density per schema (average of 26 tables per database versus 7.3 in BIRD). Benchmark analysis shows an execution success rate (SER) reduction of approximately 21% compared to BIRD, quantifying the increased reasoning difficulty.

2. Construction of Synthetic Databases and Query Taxonomy

The benchmark provides a variety of synthetic databases, each configured to capture the structure and dynamics of major real-world platforms. Schema design draws on archetypes from diverse verticals: food delivery (Doordash), accommodation (Airbnb), e-commerce (Lululemon, Shopify), freelancing (Upwork), app distribution (Google Play), vehicle rental (Turo), luxury consignment (The RealReal), analytics platforms, personalized products (Persona Nutrition), and others.

Database population is rule-based, simulating business events according to operational protocols, latent feature distributions, and temporal trends. This ensures queries can meaningfully require reasoning about causality, periodicity, and strategic alternatives.

CORGI divides queries into four principal types:

Query Type	Business Reasoning Requirement	Example
Descriptive	Record aggregation, basic retrieval	What were the total Labubu sales on September 1, 2025?
Explanatory	Trend analysis, causal inference	Why did Pop Mart offline store revenue in NYC go down in 90 days?
Predictive	Temporal forecasting, model-based inference	Expected Labubu sales NA online flagship store next month?
Recommendational	Strategic planning, actionable synthesis	How can Labubu market in Europe be expanded next year?

Descriptive queries are typically straightforward aggregations. Explanatory queries require the model to propose plausible factors (competition, supply chain issues, etc.). Predictive queries necessitate statistical or regression-based modeling to forecast future metrics. Recommendational queries demand multi-stage plans and explicit operational logic.

3. Evaluation Mechanisms and Relative Difficulty

CORGI adopts a multi-agent evaluation protocol. Model-generated SQL and corresponding business answers are evaluated by a discriminator and a panel of seven scoring agents, who judge responses over multiple dimensions, including structure, insightfulness, operational implementability, and compliance. The final score for an answer is calculated as:

$\mathrm{FinalScore} = \frac{1}{n} \sum_{i=1}^n \mathrm{Dimension\_Score}_i$

where $n$ is the number of evaluation dimensions. The evaluation framework is released publicly, with submissions managed via dedicated websites and repositories (e.g., txt2sql.com).

Empirical results indicate that CORGI’s complexity significantly lowers LLM success rates, especially for explanatory, predictive, and recommendational queries. This is quantified by a 21% lower SER relative to BIRD, with particular deficits in high-level reasoning tasks (accuracy in generating predictions and strategic recommendations). Even models that excel at code generation and factual retrieval demonstrate marked performance drop-offs, highlighting insufficient causal and agentic reasoning capabilities.

4. Causal Reasoning and Strategic Recommendation Capabilities

CORGI deliberately incorporates query types that demand higher-order reasoning. For explanatory questions, models must identify and justify causal drivers of metric changes (using evidence from data features, time windows, or cross-entity interactions). Predictive questions call on statistical modeling or time series approaches, e.g., regression-based forecasting:

$y_{t+1} = \alpha y_t + \beta_1 x_1 + \ldots + \beta_k x_k + \varepsilon$

where $y_{t}$ is the target metric, $x_i$ are explanatory features, and $\varepsilon$ is modeled noise.

Recommendational queries require that models construct multi-step, often stage-organized operational plans. Answers are scored not just on factual accuracy but also on structure, feasibility, and compliance with business logic. Models must synthesize evidence, enumerate alternative strategies, and provide reasoned justification for the preferred course of action, reflecting management consulting processes.

5. Implications for LLM Development and Research Directions

The documented performance gap on higher-level CORGI queries identifies a critical area in current LLM capabilities: insufficient agentic and business reasoning, even in advanced architectures. CORGI illustrates that improvements in code synthesis and record-oriented retrieval do not guarantee progress in analytic, explanatory, or prescriptive BI tasks.

The open dataset and evaluation framework furnish a platform for rapid benchmarking and comparative research. Researchers and developers can leverage CORGI to stress-test new LLM architectures, prompt engineering, or fine-tuning regimes with a focus on operational analytics. The benchmark’s documentation elucidates both scenario design and evaluation protocol, and its open submission process invites community engagement and extension.

A plausible implication is that future LLM systems aspiring to business intelligence applications will need to incorporate explicit causal modeling, strategic synthesis modules, and domain-specific plan generation capabilities beyond conventional text-to-SQL translation.

6. Community Resources and Benchmark Extension

CORGI’s public release includes the complete suite of synthetic databases, query sets, documentation on business scenario simulation, and the multi-agent evaluation system. The organizing team provides tooling to reproduce all experiments and invites submissions for leaderboard tracking and analysis.

This benchmark is positioned to advance both the empirical rigor and methodological innovation in applied text-to-SQL systems for business intelligence. Its structure invites further extensions to additional verticals, query modalities, and evaluation dimensions, creating a robust foundation for future BI-oriented NLP research.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to CORGI Benchmark.