Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 19 tok/s Pro
GPT-5 High 22 tok/s Pro
GPT-4o 74 tok/s Pro
Kimi K2 193 tok/s Pro
GPT OSS 120B 438 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Valid Efficiency Score (VES) Evaluation

Updated 14 July 2025
  • VES is a composite metric that measures both the correctness and efficiency of SQL queries generated from natural language inputs.
  • It evaluates queries using syntactic similarity and execution performance metrics, focusing on speed and resource utilization.
  • VES informs feedback loops in LLM fine-tuning, ensuring outputs are both factually accurate and computationally efficient.

Valid Efficiency Score (VES) is a metric used to evaluate both the correctness and efficiency of machine-generated outputs, particularly in the context of Text-to-SQL systems where natural language queries are mapped to executable SQL statements. In contemporary research, VES has gained prominence as a practical measurement for the performance of LLMs in synthesizing SQL queries, complementing traditional metrics such as execution accuracy by incorporating considerations of execution speed and resource utilization.

1. Definition and Context

The Valid Efficiency Score (VES) is defined as a composite metric that measures not only the factual or semantic correctness of a generated SQL query but also its efficiency when executed. In Text-to-SQL evaluation, VES quantifies whether a model-generated SQL statement both retrieves the correct data (matching ground truth or reference answers) and does so in an efficient manner, typically regarding execution time and database resource consumption. VES is distinct from execution accuracy (EX), which solely measures correctness, by its explicit consideration of efficiency-related criteria (Sarker et al., 2 Oct 2024).

2. VES in Text-to-SQL Model Evaluation

In advanced Text-to-SQL experiments, VES is used alongside EX to provide a more holistic picture of model performance. After a LLM generates a SQL query in response to a natural language prompt, two principal evaluation steps occur:

1. Correctness Assessment: The generated SQL is compared to the ground truth SQL using a combination of syntactic similarity (such as normalized Levenshtein distance) and semantic answer matching (comparing result types or content).

  1. Efficiency Measurement: The computational performance of the generated SQL is evaluated, with particular attention to:
    • Actual execution time on the target database.
    • Resource consumption (such as number of operations, memory usage, or planner cost).

A high VES reflects not only that the generated SQL yields correct results but also that it does so efficiently. This joint perspective is crucial in real-world database applications, where unnecessarily complex or slow queries may be impractical or even infeasible to deploy in production environments (Sarker et al., 2 Oct 2024).

3. Calculation Methodologies

While the specific formula for VES is not always universally standardized, recent approaches—such as those validated on the BIRD benchmark—suggest that VES incorporates the following principles:

  • Correctness component: Quantified using similarity metrics between the generated SQL (𝑌̂) and the ground-truth SQL (Y), such as normalized Levenshtein distance:

fnc(Y^,Y)=1L(fnorm(Y^),fnorm(Y))max(fnorm(Y^),fnorm(Y))f_{nc}(\hat{Y}, Y) = 1 - \frac{L(f_{norm}(\hat{Y}), f_{norm}(Y))}{\max(|f_{norm}(\hat{Y})|, |f_{norm}(Y)|)}

with LL being the Levenshtein distance and fnormf_{norm} a string normalization function.

  • Efficiency component: Although not always detailed with an explicit formula, efficiency is generally understood to reflect query execution time and optimality in resource usage.

VES may be computed through a multi-step pipeline where only those queries that are both valid (correct) and efficiently executable contribute positively to the metric. Empirical studies show that prompt engineering aimed solely at EX can sometimes improve factuality while reducing VES, due to longer or more complex SQL outputs (Sarker et al., 2 Oct 2024).

4. VES in Prompt Feedback Loops and Model Fine-Tuning

Recent research integrates VES-optimized feedback into LLM fine-tuning pipelines for Text-to-SQL. Here, after each query generation and evaluation, the VES (alongside EX) informs an automated feedback mechanism. If a query receives a low VES—due to slow execution or high complexity—additional steps are inserted into future prompt instructions, focusing on simplification, optimization, or review. This adaptive strategy enables large models like GPT-4 or T5 to learn not only to produce correct outputs but also to prioritize efficiency through repeated, data-driven prompt refinement cycles (Sarker et al., 2 Oct 2024).

Metric Assesses Key Factors Typical Impact
EX Execution Accuracy Syntactic/semantic correctness Higher with richer prompts, may increase code complexity
VES Valid Efficiency Score Correctness + Execution efficiency May decrease with prompt complexity; balances correctness

5. Comparative Results and Practical Implications

Empirical findings on standard benchmarks show that models optimized with feedback based on SQL quality measurement—measured by VES—achieve competitive or superior performance when compared to state-of-the-art baselines. For example, GPT-4 models using integrated knowledge graphs and human-designed step-by-step prompts have demonstrated exceptionally high EX rates. However, inclusion of more complex prompt refinements, while boosting correctness, can sometimes lead to a modest reduction in VES, reflecting the additional execution time or resource usage incurred by more elaborate SQL (Sarker et al., 2 Oct 2024).

A plausible implication is that optimizing LLMs for both correctness and efficiency requires careful balancing of prompt complexity, model capacity, and feedback mechanisms. The VES metric enables researchers and practitioners to make informed trade-offs between the robustness (factuality) and deployability (efficiency) of AI-generated database interactions.

Although currently established in Text-to-SQL and program synthesis, the concept underlying VES is broadly applicable wherever automated agents generate code or queries that must be both correct and efficient. Potential extensions include API call synthesis, automated software engineering, or workflow generation for data processing pipelines.

VES should be considered alongside complementary measures such as execution accuracy and robustness: high EX ensures factual alignment with specifications, while VES is necessary for ensuring practical, operational viability of generated outputs in production databases and enterprise environments.

7. Summary and Outlook

The Valid Efficiency Score (VES) is a multidimensional metric increasingly employed in the evaluation of natural language-to-SQL and related AI-based code generation tasks. Its adoption reflects the maturing expectations of real-world applications, where not only accuracy but also runtime efficiency and system resource consumption are critical. The continued development and refinement of VES-aware feedback loops and prompt strategies is expected to further augment the reliability and applicability of next-generation LLM-powered systems, ensuring that model outputs remain both semantically valid and operationally efficient (Sarker et al., 2 Oct 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Valid Efficiency Score (VES).