Agent Bain vs. Agent McKinsey: A New Text-to-SQL Benchmark for the Business Domain

Published 8 Oct 2025 in cs.CL | (2510.07309v2)

Abstract: In the business domain, where data-driven decision making is crucial, text-to-SQL is fundamental for easy natural language access to structured data. While recent LLMs have achieved strong performance in code generation, existing text-to-SQL benchmarks remain focused on factual retrieval of past records. We introduce CORGI, a new benchmark specifically designed for real-world business contexts. CORGI is composed of synthetic databases inspired by enterprises such as Doordash, Airbnb, and Lululemon. It provides questions across four increasingly complex categories of business queries: descriptive, explanatory, predictive, and recommendational. This challenge calls for causal reasoning, temporal forecasting, and strategic recommendation, reflecting multi-level and multi-step agentic intelligence. We find that LLM performance drops on high-level questions, struggling to make accurate predictions and offer actionable plans. Based on execution success rate, the CORGI benchmark is about 21% more difficult than the BIRD benchmark. This highlights the gap between popular LLMs and the need for real-world business intelligence. We release a public dataset and evaluation framework, and a website for public submissions.

Abstract PDF Upgrade to Chat

Summary

The paper presents CORGI, a novel benchmark tailored for business intelligence that evaluates LLMs on descriptive, explanatory, predictive, and recommendational queries.
It implements a multi-agent evaluation framework with a discriminator and seven scoring agents to assess SQL query performance across multiple dimensions.
The study reveals that current LLMs struggle with complex BI tasks, with SQL execution success rates significantly lower than those on simpler text-to-SQL benchmarks.

Agent Bain vs. Agent McKinsey: A New Text-to-SQL Benchmark for the Business Domain

Introduction and Motivation

Text-to-SQL systems, which translate natural language inputs into SQL queries, have emerged as critical tools for bridging the gap between human language and structured data access, especially in data-driven industries like business intelligence (BI). Despite the rapid advancements in LLMs, existing benchmarks predominantly focus on data retrieval capabilities, overlooking complex real-world BI scenarios that demand causal reasoning, temporal forecasting, and strategic recommendations. The CORGI benchmark addresses this gap by introducing a realistic evaluation framework for business applications, highlighting the challenges and opportunities for leveraging LLMs in these contexts.

CORGI Benchmark Design

The CORGI benchmark is a meticulously crafted evaluation framework comprising synthetic databases inspired by real-world enterprises such as Doordash, Airbnb, and Lululemon. It focuses on four key categories of business queries: descriptive, explanatory, predictive, and recommendational. Each category is designed to incrementally challenge LLM capabilities, moving from straightforward data extraction to intricate strategic planning and forecasting, thus simulating multi-step decision-making processes typical in business scenarios.

Figure 1: We incorporate business logic into our synthesized database population process, and propose a multi-agent evaluation framework consisting of a discriminator agent and seven scoring agents.

Evaluation Framework

The proposed evaluation framework utilizes a multi-agent, business literature-inspired approach. A discriminator agent first selects relevant scoring metrics, which are then processed by seven specialized scoring agents. These agents evaluate responses on dimensions such as structure, SQL execution success rate (SER), data sense, insightfulness, operational implementability, purpose alignment, and compliance, thereby ensuring a comprehensive assessment of LLM-generated outputs.

Experimental Findings

Preliminary experiments with leading LLMs reveal their struggle on complex explanatory, predictive, and recommendational tasks. The average SQL SER for high-level queries like type 2, type 3, and type 4 is reported at 67.8%, 57.4%, and 74.5%, respectively, significantly lower than simpler text-to-SQL benchmarks such as BIRD with an 88.0% success rate.

Figure 2: Comparison between our proposed atomized multi-agent evaluation mechanism and single LLM evaluation. S: Structure, D: Data Sense, I: Insightfulness, O: Operation Implementability, P: Purpose Alignment, C: Compliance. The dashed lines indicate overall average scores.

Implications and Future Directions

The CORGI benchmark underscores the gap between current LLM capabilities and the requirements of profound BI applications. LLMs exhibit substantial declines in performance when tasked with higher complexity business queries, highlighting open challenges for research into models that can incorporate strategic alignment and judgment under uncertainty.

As business applications increasingly depend on automating decision-making and strategy formulation, future research should explore extending LLM capabilities to better simulate real-world consultancy tasks. Enhancements might include developing models that integrate external knowledge bases, refine causal reasoning abilities, and apply domain-specific fine-tuning to address the nuances in complex BI scenarios.

Conclusion

The introduction of the CORGI benchmark represents a significant step towards evaluating and improving LLMs in the dynamic and demanding field of business intelligence. By aligning benchmark design with real-world business objectives and challenges, CORGI seeks to encourage advancements that will enable AI systems to deliver actionable insights, strategic recommendations, and foresight, thus bridging the gap between academic advances in text-to-SQL research and practical applications in business consulting.