Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

127 tokens/sec

GPT-4o

11 tokens/sec

Gemini 2.5 Pro Pro

53 tokens/sec

o3 Pro

5 tokens/sec

GPT-4.1 Pro

10 tokens/sec

DeepSeek R1 via Azure Pro

33 tokens/sec

2000 character limit reached

DBGorilla: Benchmarking LLM Function Calling for DB Queries

Updated 14 July 2025

DBGorilla is a synthetic benchmark that evaluates LLM performance in converting natural language into structured JSON database queries with operations like search, filtering, aggregation, and groupby.
It employs synthetic schema generation across multiple business domains to systematically test various query operator combinations and response accuracy.
The dataset’s evaluation metrics, including Exact Match and AST scoring, deliver actionable insights on LLM strengths and challenges in natural language database integration.

The DBGorilla dataset is a synthetic benchmark specifically developed to evaluate the capabilities of LLMs in querying databases through Function Calling. It provides a controlled environment for analyzing how LLMs convert natural language inputs into structured, JSON-based database query APIs that feature search, filtering, aggregation, and grouping operations. Introduced as part of a broader paper on tool-based integrations between LLMs and external systems, DBGorilla leverages synthetic data generation to enable fine-grained, systematic evaluation across a diverse set of business domains and query operators.

1. Motivation and Context

Function Calling has emerged as a principal method for integrating LLMs with external tools, allowing language-driven agents to invoke structured APIs rather than generating raw, executable code. While this approach is gaining traction in areas like web automation and tool-based reasoning, its application to database querying—especially in a manner that circumvents raw SQL generation—remains underexplored. DBGorilla addresses this need by providing a benchmark that reflects both the structural complexity and operational diversity encountered when parsing natural language database requests into formal query parameters (2502.00032).

2. Dataset Construction and Schema

The DBGorilla dataset is constructed by adapting the Gorilla LLM framework’s synthetic data generation capabilities for the unique demands of database querying. Its construction process involves:

Schema Generation: Synthetic database schemas are generated using GPT-4o and structured generation techniques. Each use case is mapped to a business domain (such as a restaurant management system), and comprises three interrelated collections (e.g., Restaurants, Menus, Reservations).
Collection Properties: Every collection contains four properties: two text (one as the searchable field), one numeric, and one boolean property.
Domain and Query Combinations: Five business domains are provided, each with its own schema. For each use case, 63 unique combinations of API operators are instantiated, across the axes of search, filter, aggregation, and groupby, yielding 315 natural language queries in total.

This synthetic but realistic design enables comprehensive coverage of the principal operations expected in real-world database APIs.

3. Unified Query API and Tool Definition

Central to DBGorilla’s methodology is a unified, JSON-based tool definition for database queries. Rather than requiring LLMs to output dialect-specific SQL, the tool expects the following arguments:

collection_name (required): The collection to query
search_query (optional): Free-text search index query
filters (optional): On numeric, boolean, or text properties
aggregations (optional): Standard functions such as COUNT and MEAN
groupby (optional): Properties for result grouping

The schema is illustrated in the following LaTeX code listing:

query_database_tool = {
 "type": "function",
 "function": {
   "name": "query_database",
   "description": "Query a database with an optional search query or optional filters or aggregations on the results. IMPORTANT! Please be mindful of the available query APIs you can use such as search queries, filters, aggregations, and groupby! Available collections in this database:\n{collections_description}",
   "parameters": {
     "type": "object",
     "properties": {
       "collection_name": {
         "type": "string",
         "description": "The collection to query.",
         "enum": collections_list
       },
       "search_query": {
         "type": "string",
         "description": "A search query to return objects from a search index."
       },
       // … additional optional parameters for filters, aggregations, and groupby …
     },
     "required": ["collection_name"]
   }
 }
}

This flexible definition enables modular evaluation of LLM performance across different combinations of API elements and isolates the challenge of mapping user intent to structured parameters.

4. Evaluation Protocols and Metrics

DBGorilla employs a multi-faceted evaluation framework to objectively assess LLM performance:

Exact Match (EM): A Boolean measure of whether the full predicted query matches the annotated API call.
Abstract Syntax Tree (AST) Scoring: Structured comparison with component-level granularity, weighted as 40% for correct collection, 15% each for search, filters, aggregations, and groupby.
LLM-as-Judge Preference Ranking: An independent LLM evaluates outputs for overall response quality, encompassing clarity and technical correctness.
Collection Routing Accuracy: The percentage of queries successfully routed to the correct collection.
Tool Selection Rate: The rate at which the LLM selects to call the function if and only if it is warranted by the input.

The following table summarizes the Exact Match results for the top-performing LLMs:

Model	Exact Match (%)
Claude 3.5 Sonnet	74.3
GPT-4o mini	73.7
GPT-4o	71.8
Gemini 1.5 Pro	70.2

5. Principal Findings

Experiments with DBGorilla reveal several key phenomena in LLM database querying via Function Calling:

High Structural Fidelity: The best-performing models not only demonstrate high Exact Match scores but also consistently achieve AST scores above 0.95, indicating robust structural understanding.
Operator-specific Performance: LLMs excel at boolean property filters (∼87.5% accuracy), but manifest consistent challenges disambiguating text property filters from search queries.
Routing Proficiency: Top-tier models consistently achieve collection routing accuracies above 96%, whereas lower-ranked models sometimes display lower EM scores despite reasonable routing, showing that correct collection identification and full API formulation are separable competencies.
Ablation Study Insights: Changes to the function calling pipeline—such as adding rationale arguments, permitting parallel tool calls, splitting tool definitions by collection, or requiring alternative structured outputs—had limited impact on overall performance (variation within 1–2% EM), indicating robustness of the design to engineering iteration.

6. Implications for LLM-Database Integration

DBGorilla provides a uniquely controlled foundation for analyzing LLM behavior under varying schema, operator, and function interface conditions. The results demonstrate:

Effectiveness of Function Calling: Function Calling is shown to be a highly effective mechanism for translating natural language commands into executable query APIs without dependence on SQL syntax.
Guidance for System Builders: The benchmark provides actionable data on the strengths and weaknesses of leading LLMs for practical database query tasks, particularly with respect to schema adaptability and cost/performance trade-offs.
Potential for Future AI Systems: The observed robustness to interface design changes suggests that Function Calling pipelines may be readily adapted to more complex compound AI architectures or non-relational query modalities (such as SPARQL or emerging research query languages), supporting broader integration efforts.

7. Conclusion and Future Directions

DBGorilla sets a new standard for the evaluation of natural-language-to-database function interfaces, offering comprehensive structured data, a robust pipeline for generating new schemas and queries, and an array of rigorous, standardized metrics. The findings from experiments with the dataset catalyze further research on natural language interfaces to databases and inform the engineering of robust LLM-based database agents. The open-sourced code and detailed reporting of experimental results provide a strong foundation for reproducible benchmarking and continued innovation in the space of LLM-enhanced database tooling (2502.00032).

PDF Markdown Chat (Upgrade)

References (1)

Querying Databases with Function Calling (2025)