Valid Efficiency Score (VES) Evaluation

Updated 14 July 2025

VES is a composite metric that measures both the correctness and efficiency of SQL queries generated from natural language inputs.
It evaluates queries using syntactic similarity and execution performance metrics, focusing on speed and resource utilization.
VES informs feedback loops in LLM fine-tuning, ensuring outputs are both factually accurate and computationally efficient.

Valid Efficiency Score (VES) is a metric used to evaluate both the correctness and efficiency of machine-generated outputs, particularly in the context of Text-to-SQL systems where natural language queries are mapped to executable SQL statements. In contemporary research, VES has gained prominence as a practical measurement for the performance of LLMs in synthesizing SQL queries, complementing traditional metrics such as execution accuracy by incorporating considerations of execution speed and resource utilization.

1. Definition and Context

The Valid Efficiency Score (VES) is defined as a composite metric that measures not only the factual or semantic correctness of a generated SQL query but also its efficiency when executed. In Text-to-SQL evaluation, VES quantifies whether a model-generated SQL statement both retrieves the correct data (matching ground truth or reference answers) and does so in an efficient manner, typically regarding execution time and database resource consumption. VES is distinct from execution accuracy (EX), which solely measures correctness, by its explicit consideration of efficiency-related criteria (2410.01869).

2. VES in Text-to-SQL Model Evaluation

In advanced Text-to-SQL experiments, VES is used alongside EX to provide a more holistic picture of model performance. After a LLM generates a SQL query in response to a natural language prompt, two principal evaluation steps occur:

Correctness Assessment: The generated SQL is compared to the ground truth SQL using a combination of syntactic similarity (such as normalized Levenshtein distance) and semantic answer matching (comparing result types or content).
Efficiency Measurement: The computational performance of the generated SQL is evaluated, with particular attention to:
- Actual execution time on the target database.
- Resource consumption (such as number of operations, memory usage, or planner cost).

A high VES reflects not only that the generated SQL yields correct results but also that it does so efficiently. This joint perspective is crucial in real-world database applications, where unnecessarily complex or slow queries may be impractical or even infeasible to deploy in production environments (2410.01869).

3. Calculation Methodologies

While the specific formula for VES is not always universally standardized, recent approaches—such as those validated on the BIRD benchmark—suggest that VES incorporates the following principles:

Correctness component: Quantified using similarity metrics between the generated SQL (𝑌̂) and the ground-truth SQL (Y), such as normalized Levenshtein distance:

$f_{nc}(\hat{Y}, Y) = 1 - \frac{L(f_{norm}(\hat{Y}), f_{norm}(Y))}{\max(|f_{norm}(\hat{Y})|, |f_{norm}(Y)|)}$

with $L$ being the Levenshtein distance and $f_{norm}$ a string normalization function.

Efficiency component: Although not always detailed with an explicit formula, efficiency is generally understood to reflect query execution time and optimality in resource usage.

VES may be computed through a multi-step pipeline where only those queries that are both valid (correct) and efficiently executable contribute positively to the metric. Empirical studies show that prompt engineering aimed solely at EX can sometimes improve factuality while reducing VES, due to longer or more complex SQL outputs (2410.01869).

4. VES in Prompt Feedback Loops and Model Fine-Tuning

Recent research integrates VES-optimized feedback into LLM fine-tuning pipelines for Text-to-SQL. Here, after each query generation and evaluation, the VES (alongside EX) informs an automated feedback mechanism. If a query receives a low VES—due to slow execution or high complexity—additional steps are inserted into future prompt instructions, focusing on simplification, optimization, or review. This adaptive strategy enables large models like GPT-4 or T5 to learn not only to produce correct outputs but also to prioritize efficiency through repeated, data-driven prompt refinement cycles (2410.01869).

Metric	Assesses	Key Factors	Typical Impact
EX	Execution Accuracy	Syntactic/semantic correctness	Higher with richer prompts, may increase code complexity
VES	Valid Efficiency Score	Correctness + Execution efficiency	May decrease with prompt complexity; balances correctness

5. Comparative Results and Practical Implications

Empirical findings on standard benchmarks show that models optimized with feedback based on SQL quality measurement—measured by VES—achieve competitive or superior performance when compared to state-of-the-art baselines. For example, GPT-4 models using integrated knowledge graphs and human-designed step-by-step prompts have demonstrated exceptionally high EX rates. However, inclusion of more complex prompt refinements, while boosting correctness, can sometimes lead to a modest reduction in VES, reflecting the additional execution time or resource usage incurred by more elaborate SQL (2410.01869).

A plausible implication is that optimizing LLMs for both correctness and efficiency requires careful balancing of prompt complexity, model capacity, and feedback mechanisms. The VES metric enables researchers and practitioners to make informed trade-offs between the robustness (factuality) and deployability (efficiency) of AI-generated database interactions.

Although currently established in Text-to-SQL and program synthesis, the concept underlying VES is broadly applicable wherever automated agents generate code or queries that must be both correct and efficient. Potential extensions include API call synthesis, automated software engineering, or workflow generation for data processing pipelines.

VES should be considered alongside complementary measures such as execution accuracy and robustness: high EX ensures factual alignment with specifications, while VES is necessary for ensuring practical, operational viability of generated outputs in production databases and enterprise environments.

7. Summary and Outlook

The Valid Efficiency Score (VES) is a multidimensional metric increasingly employed in the evaluation of natural language-to-SQL and related AI-based code generation tasks. Its adoption reflects the maturing expectations of real-world applications, where not only accuracy but also runtime efficiency and system resource consumption are critical. The continued development and refinement of VES-aware feedback loops and prompt strategies is expected to further augment the reliability and applicability of next-generation LLM-powered systems, ensuring that model outputs remain both semantically valid and operationally efficient (2410.01869).

PDF Markdown Chat (Upgrade)

References (1)

Enhancing LLM Fine-tuning for Text-to-SQLs by SQL Quality Measurement (2024)