BlendSQL: Hybrid Query Language for LM-DB Integration
- BlendSQL is a SQL-like declarative language that integrates structured database queries with language model functions to perform hybrid reasoning over tables and documents.
- It employs type inference, constrained decoding, and rule-based optimization for well-typed, efficient execution on backends like SQLite, DuckDB, and PostgreSQL.
- BlendSQL advances hybrid QA by reducing prompt tokens and achieving significant improvements in latency and cost compared to end-to-end LM systems.
Searching arXiv for papers on BlendSQL and closely related work. BlendSQL is a SQL-like declarative language and execution framework for hybrid reasoning over structured database data and unstructured or parametric language-model knowledge. It was introduced as a superset of SQLite for hybrid question answering, with the central premise that multi-hop reasoning over tables and documents can be encoded as a single executable query rather than delegated to opaque end-to-end prompting (Glenn et al., 2024). Subsequent work reframed BlendSQL as a system that compiles to SQL, executes through native database engines, and uses type inference, constrained decoding, and rule-based optimization to make language-model functions well-typed, executable, and economical at database scale (Glenn et al., 24 Sep 2025). In its v0.1.0 form, BlendSQL is presented as an LM-DB system that combines a query optimizer with LM function implementations, including prompts, constrained decoding, and post-processing, and supports SQLite, DuckDB, and PostgreSQL backends (Glenn et al., 30 Jun 2026).
1. Origins and conceptual framing
BlendSQL emerged from work on hybrid question answering over heterogeneous evidence sources such as tables and passages. The 2024 paper argues that many end-to-end systems for HybridQA, OTT-QA, and FEVEROUS reduce to a “prompt-and-pray” pattern in which decomposition is hidden, intermediate reasoning is not inspectable, and the full structured and unstructured context often cannot fit into a prompt because of context-window limitations (Glenn et al., 2024). BlendSQL addresses this by expressing the decomposed reasoning roadmap explicitly as a query in a SQL-like intermediate language.
In that formulation, BlendSQL is not merely a prompting template. It is a relational-algebraic control layer in which ordinary SQL performs deterministic filtering, joining, aggregation, and ordering, while LM-backed operators are invoked only for semantic steps that are difficult to express with native SQL. The system thus combines interpretability and compositionality from database query languages with the semantic flexibility of LLMs (Glenn et al., 2024).
Later work sharpened this DB-centric perspective. The 2025 paper presents BlendSQL as a SQL-like declarative language for hybrid reasoning over structured database data and unstructured or parametric LLM knowledge, but with explicit emphasis on type safety and efficient execution (Glenn et al., 24 Sep 2025). The 2026 systems paper further recasts BlendSQL as an LM-DB framework whose practical viability depends not only on query syntax, but also on optimizer behavior, constrained decoding, and local deployment of quantized open-weight models (Glenn et al., 30 Jun 2026).
A common misconception is to conflate BlendSQL with the OLAP operator BLEND introduced for multigradual analysis in multidimensional databases. The two are unrelated in scope and formalism: BLEND is an OLAP algebra operator for combining hierarchy levels during query-time analysis (Hubert et al., 2010), whereas BlendSQL is a language and execution framework for combining SQL with language-model functions (Glenn et al., 2024).
2. Language design and execution model
The original BlendSQL paper defines the language as a superset of SQLite that orchestrates and optimizes hybrid reasoning across compositional SQL logic and LLMs (Glenn et al., 2024). LM-backed functions, termed “ingredients,” are wrapped in double curly braces, , and may appear recursively inside SQL expressions. This makes BlendSQL fully recursive like vanilla SQL, while allowing semantic operators to be embedded directly inside relational queries.
The 2026 paper describes BlendSQL v0.1.0 more generally as “a query language that compiles to SQL”. Function outputs are produced by prompt templates, optionally type-constrained, integrated into SQL via sqlglot, and the final SQL query is then executed on the backend DBMS (Glenn et al., 30 Jun 2026). This formulation generalizes the earlier SQLite-centric implementation to multiple backends: SQLite, DuckDB, and PostgreSQL (Glenn et al., 30 Jun 2026).
The 2025 paper gives the clearest operational description of execution. A BlendSQL query is parsed into an AST, traversed depth-first in standard SQL order of operations (FROM/JOIN → WHERE → GROUP BY → ...), and LM-based functions are deferred until their required inputs are known (Glenn et al., 24 Sep 2025). Native SQL operators are executed first, while LLM functions are treated as high-cost or “infinite cost” operations in a heuristic planner. When an LLM function is executed, the system collects referenced tables, materializes CTEs if needed, reads or reuses temp tables, invokes the model, transforms the AST with the returned value, syncs the transformed AST back to SQL string form, and executes the final SQL on the DBMS (Glenn et al., 24 Sep 2025).
This execution strategy makes BlendSQL a reasoning language in which the program is also the runtime plan. The 2024 paper operationalizes this with the metaphor of a Parser—the model that generates the BlendSQL query from the question and database context—and a Blender—the model used during execution of ingredient functions (Glenn et al., 2024). A plausible implication is that BlendSQL separates semantic parsing from semantic execution more explicitly than many end-to-end QA pipelines.
3. Core operators and the role of LM functions
Across the literature, BlendSQL emphasizes a small number of generic LM-backed operators that can be composed with ordinary SQL. The 2024 paper presents three built-in ingredients: LLMMap, LLMQA, and LLMJoin (Glenn et al., 2024). The later systems papers shift emphasis toward llmqa and llmmap as a small set of general-purpose polymorphic functions rather than a large catalog of specialized operators (Glenn et al., 30 Jun 2026).
| Operator | Role in the system | Source |
|---|---|---|
LLMMap / llmmap |
Row-wise mapping from one or more column values to new values | (Glenn et al., 2024, Glenn et al., 30 Jun 2026) |
LLMQA / llmqa |
Aggregate-like transformation from a subset of data to a single-cell output | (Glenn et al., 2024, Glenn et al., 30 Jun 2026) |
LLMJoin |
Semantic entity linking or fuzzy alignment between value sets | (Glenn et al., 2024) |
LLMMap is introduced as a unary scalar function analogous to LENGTH or ABS, but powered by an LLM; it maps each value in a target column to a new value (Glenn et al., 2024). In v0.1.0, llmmap is described as a row-wise function that takes a natural-language question and one or more columns, inserts column values into the prompt for each row, and passes the resulting prompt to the model (Glenn et al., 30 Jun 2026). The 2026 paper notes an important change in prompting strategy: pre-v0.1.0 defaulted to batching multiple rows per prompt, whereas v0.1.0 uses single-in, single-out prompting with batch_size=1 by default because the earlier batching strategy hurt accuracy (Glenn et al., 30 Jun 2026).
LLMQA is described in 2024 as an aggregate function that turns a table subset into a single value and can optionally be restricted so that the output must be one of the values from a given column (Glenn et al., 2024). In v0.1.0, llmqa remains an aggregate function, but the paper emphasizes that it can return either a scalar or a collection, enabling use cases such as top- ranking and unstructured-to-structured linking (Glenn et al., 30 Jun 2026).
LLMJoin appears in the original system as an operator for semantic entity linking when proper foreign keys do not exist, such as aligning "joshua fields" with "josh fields (pitcher)" (Glenn et al., 2024). Later descriptions focus more heavily on llmmap and llmqa, which suggests a consolidation toward a smaller operator core, though the papers do not state that semantic joins cease to be relevant.
The examples in the papers show how these operators support entity lookup, nested reasoning over table and text, ranking, fact verification, and unit conversion inside executable SQL-like programs (Glenn et al., 2024, Glenn et al., 24 Sep 2025). This establishes BlendSQL as a hybrid between a declarative query language and an LM orchestration substrate.
4. Type inference, constrained decoding, and well-typedness
A defining development in BlendSQL is the move from loosely prompted LM calls toward type-constrained execution. The 2025 paper identifies the central obstacle as the mismatch between typed declarative query languages and free-form LLM outputs. A semantically correct answer may still fail if it does not match the exact values or type expectations required by SQL execution—for example, "Washington D.C." versus "Washington DC" in an equality predicate (Glenn et al., 24 Sep 2025).
BlendSQL addresses this by inferring the return type of an LM function from its expression context and enforcing that type at decoding time. The paper describes three modes: no type hints, prompt-level type hints, and Type Hints + Constrained Decoding, the last being the main proposal (Glenn et al., 24 Sep 2025). Its procedure is:
- infer the return type of the LLM UDF from the SQL context;
- map that type to a regular expression or grammar used during decoding;
- cast the decoded string to the native type and rewrite the SQL AST accordingly.
The paper explicitly states that, barring user syntax errors, this guarantees that the LLM output results in a query accepted by the SQL type checker (Glenn et al., 24 Sep 2025). It also introduces database-driven constraints: in equality or membership contexts, BlendSQL may infer Literal[...] or List[Literal[...]] types from the actual distinct values in the relevant column, thereby constraining generation directly to database values rather than aligning them post hoc (Glenn et al., 24 Sep 2025).
The sample type rules make the mechanism concrete. Examples include f() = TRUE implying f() \to bool, f() > 40 implying f() \to int, f() BETWEEN 60.1 AND 80.3 implying f() \to float, city = f() implying f() \to Literal[{'Washington DC', 'San Jose'}], and team IN f() implying f() \to List[Literal[{'Red Sox', 'Mets'}]] (Glenn et al., 24 Sep 2025). The mapping from return type to regex or grammar is explicit, for example int → \d+ (Glenn et al., 24 Sep 2025).
The 2026 paper extends this machinery in v0.1.0. All LM functions accept a return_type, which is used to constrain decoding via a context-free grammar; if options are provided, generation can be restricted to that set; and if the return type is a collection, quantifier can impose regex-like cardinality constraints (Glenn et al., 30 Jun 2026). It also introduces an improved type inference system and type-aligned one-shot prompting, in which one example is selected from a pre-defined pool to match the expected return type (Glenn et al., 30 Jun 2026).
This shift is central to how BlendSQL differs from orchestration-heavy alternatives. Rather than generating free-form output and then issuing further LLM calls for alignment or semantic equivalence checking, BlendSQL moves the alignment problem into decoding and typing itself (Glenn et al., 24 Sep 2025). The papers consistently present this as a latency and executability improvement rather than a purely formal refinement.
5. Query optimization and systems engineering
BlendSQL’s later work places strong emphasis on optimization of LM-enhanced relational operators. The 2026 paper describes BlendSQL v0.1.0 as combining two layers: a query optimizer that rewrites and schedules LM calls, and the LM function implementation itself, including prompts, constrained decoding, and post-processing (Glenn et al., 30 Jun 2026).
The optimizer is rule-based and built on the heuristic cost model
which encodes the assumption that vanilla SQL operators are much cheaper than llmqa, which is itself cheaper than llmmap (Glenn et al., 30 Jun 2026). The execution pipeline has three stages: Pre-filtering, Materialization, and LM function execution (Glenn et al., 30 Jun 2026). During pre-filtering, the system removes blocking operators such as ORDER BY and GROUP BY, replaces LM nodes with constant TRUE, and projects only columns needed by LM functions. The TRUE substitution is specifically motivated by disjunctive logic, where expressions such as WHERE f_LM() OR f() would otherwise risk unsafe over-filtering (Glenn et al., 30 Jun 2026).
Three additional optimizations are explicitly defined and evaluated in v0.1.0:
- Cascade filtering: for conjunctive predicates with multiple LM functions, later LM calls only process rows that passed earlier ones, and the optimizer prioritizes
llmqabeforellmmapunder the cost heuristic (Glenn et al., 30 Jun 2026). - Early exiting: with
LIMIT,llmmapcan stop once enough rows are produced, provided no blocking operators require a full scan (Glenn et al., 30 Jun 2026). - Early deduplication of database values: inputs are deduplicated before
llmmap, and outputs are aligned back with aLEFT JOIN, reducing LM calls by the duplication factor and, explicitly, from to (Glenn et al., 30 Jun 2026).
System engineering choices also matter. The 2025 paper emphasizes prefix caching for repeated prompt prefixes in map-style execution (Glenn et al., 24 Sep 2025). The 2026 paper adds support for image and audio datatypes and reports faster aggregation of in-memory LM outputs via polars (Glenn et al., 30 Jun 2026). Open-weight experiments are run locally on an NVIDIA RTX 5080 GPU with 16GB VRAM, 64GB RAM, and AMD Ryzen 5 7600X, using quantized HuggingFace checkpoints and an inference stack including vLLM==0.21.0, llguidance, and blendsql==0.1.26 (Glenn et al., 30 Jun 2026).
A major systems claim in the 2026 paper is that proprietary API-based models are not necessary for effective LM-database integration, because token-based billing, network/API overhead, rate limits, and changing API behavior impose costs, latency, and reproducibility problems (Glenn et al., 30 Jun 2026). Local deployment on a single 16GB GPU changes the economic model from token-based billing to hardware-bounded, time-based cost, so latency improvements translate directly into cost savings (Glenn et al., 30 Jun 2026).
6. Benchmarks, empirical results, limitations, and significance
BlendSQL has been evaluated in several phases, with different benchmarks emphasizing different aspects of the system.
The 2024 paper focuses on hybrid QA benchmarks. On the HybridQA dev set, reported values are 56.68 for end-to-end with retriever, 49.13 for end-to-end without retriever, 52.89 for BlendSQL, and 57.76 for BlendSQL + End-to-End (Glenn et al., 2024). A pure BlendSQL script failed to produce a prediction on 33% of dev examples—1,173 out of 3,466—but on the subset where scripts executed successfully, accuracy reached 64.43% (Glenn et al., 2024). The paper also reports that BlendSQL uses 35% fewer prompt tokens on average for HybridQA context filtering, and the conclusion summarizes the HybridQA result as 45% fewer prompt tokens when improving a naive end-to-end system by 8.63% (Glenn et al., 2024). On OTT-QA dev, BlendSQL reports 34.15, compared with 28.1 for FR+CBR, 33.2 for CARP, and 49.0 for CORE (Glenn et al., 2024). On FEVEROUS dev, reported values are 83.21 for End-to-End, 65.7 for BlendSQL, and 68.03 for BlendSQL (Entire table) (Glenn et al., 2024).
The 2025 paper evaluates efficiency and expressivity on 60 questions from TAG-Bench, whose tables have an average size of 53,631 rows, and evaluates question answering on the first 1,000 examples of the HybridQA validation set (Glenn et al., 24 Sep 2025). Its headline claims are a 7% accuracy improvement and 53% latency improvement (Glenn et al., 24 Sep 2025). The strongest reported accuracy lift comes when using the 3B execution model with a 70B parsing model, where denotation accuracy improves by 6.6 points after applying type constraints (Glenn et al., 24 Sep 2025). On TAG-Bench, latency drops from 1.7 seconds for LOTUS with Llama-3.1-8b-Instruct.Q4 on 1 RTX 5080 to 0.76 seconds for BlendSQL on the same model and hardware, corresponding to a 53% reduction in latency (Glenn et al., 24 Sep 2025).
The 2026 paper evaluates BlendSQL v0.1.0 on SemBench, an extension of TPC-H and TPC-DS with five scenarios—Movie, Wildlife, E-Commerce, Cars, and MMQA—covering text, image, and audio modalities (Glenn et al., 30 Jun 2026). It reports F1 score for retrieval queries, Spearman’s rank correlation for ranking queries, and absolute error for aggregation queries, all normalized to and reported as quality (Glenn et al., 30 Jun 2026). In a comparison of system coverage and supported features, BlendSQL is the only system reported as supporting EE, CF, CD, and ED, with 100% SemBench coverage across 55 questions (Glenn et al., 30 Jun 2026).
The headline result is that quantized open-weight models running locally on 16GB of VRAM can match or exceed closed-source counterparts at lower latency and a fraction of the price (Glenn et al., 30 Jun 2026). Specifically, BlendSQL with Gemma 4 E4B achieves a query-level win-or-tie rate of 57% across all scenarios relative to the closed Gemini 2.5 Flash setting (Glenn et al., 30 Jun 2026). It outperforms the average SemBench system in mmqa: 0.85 vs. 0.70 and wildlife: 0.54 vs. 0.51, and nearly matches the average system on movie: 0.72 vs. 0.74 (Glenn et al., 30 Jun 2026). The paper also reports a modality gap: open-weight Gemma 4 E4B slightly outperforms Gemini 2.5 Flash on text, but trails on image and especially audio, with a gap of 0.03 for images and 0.10 for audio (Glenn et al., 30 Jun 2026).
Latency and cost results are similarly concrete. The abstract and conclusion report a 3.8x reduction in latency and 390x reduction in overall costs compared to a proprietary LM API (Glenn et al., 30 Jun 2026). The main comparison specifies a cost reduction from 00.58 (Glenn et al., 30 Jun 2026). On the text-only movie scenario, latency drops from 17.2s to 8.3s when concurrency increases from 8 to 128; early exiting reduces average latency in the movie scenario from 24.3s to 8.8s; cascade filtering saves 142 seconds on Q14 in the ecomm scenario; and early deduplication saves 40 seconds in the Movie scenario (Glenn et al., 30 Jun 2026). At scale factor 4,000 in ecomm, ThalamusDB peaks at 268GB, Palimpzest at 252GB, LOTUS at 161GB, and BlendSQL at 18GB (Glenn et al., 30 Jun 2026).
These results are explicitly conditional. The 2026 paper states that the 390x cost reduction and 3.8x latency reduction depend on local deployment on a single 16GB GPU, Gemma 4 E4B as the main model, max concurrency = 64 for the overall results table, five runs averaged, and SemBench scenarios and scale factors matching the benchmark setup (Glenn et al., 30 Jun 2026). They are therefore not generic claims about every model or every database workload.
Limitations remain a recurring theme across the literature. The 2024 paper notes parser errors, blender-centric errors, and especially difficulty in doc → table reasoning, while also acknowledging reliance on GPT-4 and the absence of open-source LLM experiments (Glenn et al., 2024). The 2025 paper reports that generated programs from Llama-70B had errors on 102 out of 1000 samples, including empty LLMQA context, generic SQLite syntax errors, BlendSQL column reference errors, hallucinated columns, tokenization errors, hallucinated tables, and f-string syntax issues (Glenn et al., 24 Sep 2025). It further observes that CFG guidance can reduce syntax errors, but syntax reduction does not always correlate with better downstream accuracy, and for the largest model may even hurt performance (Glenn et al., 24 Sep 2025). The 2026 paper identifies remaining modality gaps for image and audio despite strong text performance (Glenn et al., 30 Jun 2026).
Taken together, the papers position BlendSQL as a declarative LM-DB interface whose distinctive contributions are explicit reasoning programs, selective LM invocation inside SQL execution, context-sensitive type inference, constrained decoding against database values, and increasingly sophisticated optimizer support. This suggests a broader systems lesson: many expensive post-processing and orchestration steps in hybrid LLM+DB workflows can be replaced by typed, query-planned, and locally deployable execution without abandoning the compositional structure of relational queries (Glenn et al., 24 Sep 2025, Glenn et al., 30 Jun 2026).