Querying Databases with Function Calling (2502.00032v1)

Published 23 Jan 2025 in cs.DB, cs.AI, and cs.IR

Abstract: The capabilities of LLMs are rapidly accelerating largely thanks to their integration with external tools. Querying databases is among the most effective of these integrations, enabling LLMs to access private or continually updating data. While Function Calling is the most common method for interfacing external tools to LLMs, its application to database querying as a tool has been underexplored. We propose a tool definition for database querying that unifies accessing data with search queries, filters, or a combination both, as well as transforming results with aggregation and groupby operators. To evaluate its effectiveness, we conduct a study with 8 LLMs spanning 5 model families. We present a novel pipeline adapting the Gorilla LLM framework to create synthetic database schemas and queries. We primarily evaluate the models with the Exact Match of predicted and ground truth query APIs. Among the models tested, Claude 3.5 Sonnet achieves the highest performance with an Exact Match score of 74.3%, followed by GPT-4o mini at 73.7%, and GPT-4o at 71.8%. We further breakdown these results per API component utilized and across synthetic use cases. We find that LLMs are highly effective at utilizing operators on boolean properties, but struggle with text property filters. Across use cases we find robust results with the higher performing models such as GPT-4o, but significant performance variance across use cases from lower performing models. We additionally conduct ablation studies exploring the impact of parallel tool calling, adding a rationale as an argument of the tool call, using a separate tool per database collection, and tool calling with structured outputs. Our findings demonstrate the effectiveness of enabling LLMs to query databases with Function Calling. We have open-sourced our experimental code and results at github.com/weaviate/gorilla.

Summary

The paper introduces a robust tool definition and the DBGorilla framework to evaluate large language model performance in querying databases using function calling.
Benchmarking eight LLMs on DBGorilla shows top models achieving 70-74% Exact Match, revealing significant performance variation based on query type and database schema.
The findings suggest that leveraging Function Calling allows LLMs to streamline complex database querying, enhance AI-driven data interaction, and improve overall data accessibility for users.

Insightful Overview of "Querying Databases with Function Calling"

The paper "Querying Databases with Function Calling" explores the interface of LLMs with databases, particularly leveraging Function Calling to query databases—a domain that has not been extensively explored. The primary contribution centers on introducing a robust tool definition that integrates database querying capabilities encompassing search queries, filtering, result transformation, and query routing across multiple data collections. The authors propose a comprehensive evaluation framework using the DBGorilla dataset, contrasting their proposed method against widely recognized text-to-SQL benchmarks like BIRD, Spider, and WikiSQL.

This paper benchmarks the efficacy of eight LLMs from five distinct model families using the DBGorilla pipeline, adapted from the Gorilla LLM framework. The research focuses on metrics such as Exact Match scores, Abstract Syntax Tree (AST) alignment, and LLM-as-Judge preference rankings. The Exact Match results reveal that Claude 3.5 Sonnet excels, achieving a score of 74.3%, followed closely by GPT-4o mini (73.7%), GPT-4o (71.8%), and Gemini 1.5 Pro (70.2%). The paper uncovers significant variability in performance based on component-specific queries, with LLMs showing competence in handling boolean operators but struggling with differentiating text property filters from search queries.

A pivotal aspect discussed is the variance in LLM performance across different database schemas, which reflects models' adaptability to diverse domains, further emphasizing the utility of synthetic schemas in simulating complex real-world environments. The economic analysis of computational costs illuminates the disparity in price-performance efficiency among the LLMs, highlighting GPT-4o mini as a cost-effective choice.

The ablation studies add immense value, showing that changes such as requiring rationale for function calls, enabling parallel function calls, utilizing structured outputs, and implementing a per-collection tool architecture, display minimal performance degradation. This establishes a stable baseline for adopting these variations in real-world applications.

The findings contribute significantly to the discourse on connecting LLMs with databases via Function Calling. The paper's implications are profound, suggesting that enabling LLMs to interface with databases in this manner could streamline complex querying processes, enhance the versatility of AI-driven data interaction, and broadly improve data accessibility.

From a theoretical perspective, the integration of Function Calling could help to unify the operational syntax across heterogeneous database systems. Practically, the methodologies and datasets presented could stimulate the development of more nuanced AI systems capable of executing multifaceted operations, potentially simplifying database management for non-specialist stakeholders.

Future developments could focus on expanding DBGorilla to include more complex schema representations and intricate query mechanics, challenging LLMs to exhibit deeper reasoning and adaptability. This could eventually lead to LLMs supporting dynamic and context-aware database interactions within evolving AI ecosystems, marking a progressive shift in AI applications involving structured data environments.