PARROT-Diverse SQL Benchmark

Updated 5 October 2025

PARROT-Diverse is a large-scale, dialect-rich evaluation dataset featuring 28,003 SQL query translation pairs from 22 production-grade databases.
It assesses LLM performance using metrics for syntax/dialect compatibility and semantic result consistency to capture nuanced translation challenges.
The dataset highlights real-world issues in cross-system SQL translation, prompting research in dialect-specific model tuning and robust query segmentation.

PARROT-Diverse is a large-scale, dialect-rich evaluation dataset introduced as part of the PARROT benchmark suite for assessing the capabilities of LLMs and translation tools in Cross-System SQL Translation, namely the transformation of queries across heterogeneous database systems. Unlike canonical Text-to-SQL benchmarks, PARROT-Diverse focuses specifically on the subtleties and breadth of SQL dialects encountered in real production and open-source systems. It comprises 28,003 annotated SQL query translation pairs spanning 22 production-grade databases, facilitating extensive syntax-level stress testing and the evaluation of semantic equivalence across system boundaries.

1. Scope and Purpose of PARROT-Diverse

PARROT-Diverse is designed to provide comprehensive coverage across a wide spectrum of SQL dialects, moving beyond the limitations of previous benchmarks that are largely confined to SQLite or generic SQL. Its primary objective is to enable rigorous testing of LLMs’ ability to correctly translate queries between real-world systems (e.g., PostgreSQL, MySQL, Oracle, ClickHouse, DuckDB, SQL Server), each of which exhibits unique syntax rules, built-in functions, error-handling constructs, and type management idiosyncrasies. Coverage is achieved through a combination of mined queries from open-source benchmarks and production services, systematically annotated for dialect and system-specific requirements.

2. Dataset Structure and Content

With 28,003 automatically annotated translation samples, PARROT-Diverse encapsulates queries sourced from a range of real-world and reference benchmarks. Each sample consists of a source query written in the dialect of one database system and its corresponding target query—the translation—in another system’s dialect. Annotation includes explicit dialect markers and metadata to identify system-specific features such as custom functions, keyword usage, error handling, data type conversions, and scoping rules. This construction ensures that participants are evaluated on nuanced syntactic and semantic equivalence, not mere string-level similarity.

Variant	Number of Translations	Number of Systems
PARROT-Diverse	28,003	22
PARROT-Simple	5,306	22

The larger scale of PARROT-Diverse makes it suitable for extensive syntax robustness and edge-case exploration that are critical for practical deployment settings.

3. Evaluation Protocols and Metrics

PARROT-Diverse utilizes two principal accuracy metrics reflecting executable correctness and semantic fidelity:

Syntax/Dialect Compatibility Accuracy (Acc_EX):

$\text{Acc}_{\text{EX}} = \frac{\text{Number of executable translations in the target system}}{\text{Total number of queries}}$

Result Consistency Accuracy (Acc_RES):

$\text{Acc}_{\text{RES}} = \frac{\text{Number of translations producing identical results to the source query}}{\text{Total number of queries}}$

These metrics are computed using reference executors and schema normalizers, which validate both the syntactic executability and semantic output equivalence across systems. This approach rewards models not simply for generating well-formed SQL, but for achieving reliable translation of meaning and function.

4. Benchmark Challenges and Dialect-Specific Testing

PARROT-Diverse is specifically constructed to expose major challenges inherent to cross-system SQL translation:

Dialect-specific Syntax: Queries often require non-trivial transformation for compatibility. For example, expressions like 1 / col may necessitate rewriting to 1 / NULLIF(col, 0) in systems with stricter type enforcement or division-by-zero handling.
Function and Feature Differences: System-specific built-ins (string functions, aggregations, date/time utilities) demand precise translation strategies.
Type and Keyword Variations: Even common operations may involve differing type names or reserved keywords, requiring context-aware translation approaches.
Nested and Alias Scoping: Complex queries with nested subselects, aliases, and CTEs test model understanding of system rule differences.

The diversity of PARROT-Diverse in query structure, length, and complexity directly motivates the evaluation of models not only on syntax parsing but also on deeper SQL semantics and idiosyncratic feature handling.

5. Performance Analyses and Empirical Findings

Experimental evaluations reported in the paper highlight significant variation in LLM performance across dialects and query complexity. State-of-the-art LLMs, such as GPT-4o, attained higher accuracy when translating to systems like PostgreSQL (≈58.62%) but lower for MySQL (≈50%), with overall accuracy for the diverse set below 38.53% on average. Additionally, model performance tends to degrade for longer, more intricate queries, signaling limitations in both prompt length handling and dialect generalization. This observed oscillation across dialects, along with regression on complex structures, demonstrates that current LLMs are far from "solving" cross-system SQL translation and underscores dialect as a principal source of translation errors.

6. Directions for Research and Tool Development

Analysis of results from PARROT-Diverse suggests several priorities for future research:

Dialect-Specific Augmentation: Targeted augmentation or fine-tuning for system-specific syntax and features promises to improve uniformity of translation quality.
Segment-Based Translation: For lengthy queries, segmenting and translating in parts may mitigate performance losses, although effective reassembly remains an open challenge.
Error Detection and Robustness: Improved detection of non-executable or semantically divergent translations is required to further enhance reliability.
Leaderboard and Community Resources: The paper includes a public leaderboard and source code to promote benchmarking transparency and stimulate methodological innovations.

A plausible implication is the need for the development of translation strategies that are robust to both the breadth (number of supported dialects) and depth (complexity of queries) required in practical cross-system SQL tasks.

7. Significance in the SQL-to-SQL Translation Landscape

PARROT-Diverse represents a material advance in evaluating SQL translation across systems, providing both breadth and nuance that addresses key limitations of prior benchmarks. Its use of executable and result consistency metrics, coupled with large-scale, system-specific coverage, serves as both a stress test and a diagnostic tool for LLM-based systems. The benchmark not only identifies persistent challenges in dialect compatibility and semantic fidelity but also fosters methodological progress by supplying a detailed, publicly accessible evaluation protocol and dataset.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to PARROT-Diverse.