LMQL: Constrained Language Model Querying

Updated 21 December 2025

LMQL is a programming paradigm that blends natural language, imperative scripting, and declarative output constraints to orchestrate reliable, semantically controlled LLM responses.
It enforces syntactic and semantic constraints via logical predicates and optimized control flow, thereby reducing invalid outputs and redundant model invocations.
LMQL supports multi-step workflows, interactive tool use, and hybrid database querying, yielding significant inference cost savings and improved accuracy.

The LLM Query Language (LMQL) is a high-level programming paradigm designed to orchestrate and constrain the interaction with LLMs. LMQL combines natural language prompts, imperative scripting, and declarative output constraints into a unified, backend-agnostic query language. This design enables efficient, expressive, and semantically controlled workflows over LMs, supporting advanced prompt engineering, structured information extraction, interactive tool use, and hybrid database–LLM querying. LMQL delivers substantial reductions in inference costs and enhances accuracy on downstream tasks by minimizing invalid or redundant model invocations through eager decoding constraints and optimized control flow (Beurer-Kellner et al., 2022).

1. Motivation and Conceptual Framework

Traditional prompt engineering in LLMs involves crafting natural language instructions or in-context examples, but remains limited by the lack of explicit output control, programmatic task orchestration, and cost-effective execution. Token-by-token model decoding is typically opaque, offers little granularity in constraining outputs (e.g., valid JSON, dates), and any complex workflow—few-shot demos, external tool calls, chain-of-thought—is brittle and ad hoc, relying on either model-specific scripting or manual intervention.

LMQL is grounded in the concept of LLM Programming (LMP). LMP generalizes prompting from static text to a programming paradigm that interleaves textual prompts, host-language (Python) control flow, and declarative output constraints. This abstraction enables:

Enforced semantic and syntactic constraints on generated text via logical predicates.
Model-agnostic, reusable programmatic workflows.
Efficient multi-step reasoning, tool use, and interaction with deterministic external APIs.
Reduction in total LLM API calls (up to 80%), often yielding 26–85% cost savings without accuracy loss.

The scripting model bridges traditional text prompting and the rigor of programmatic data workflows, akin to relational databases or functional query languages, but for LLM outputs.

2. Language and Execution Model

An LMQL program consists of five main elements: a decoder (sampling strategy), a prompt script (with embedded “holes”), an explicit model designation, optional logical constraints (WHERE), and optional probabilistic output distribution (DISTRIBUTE).

Syntax and Core Constructs

Prompt blocks: Free-form Python code interleaved with string literals. String literals may contain holes, denoted by [VAR] (to be filled by the LM) and {VAR} (text substitution of a bound variable).
Control flow: Host-language constructs (loops, conditionals, function calls) are natively supported.
Queries: Each [VAR] in a string triggers an LLM call scoped by the cumulative input and output constraints.
Decoders: Support for argmax, beam search (beam(n=...)), or sampling (sample(n=...)).
Model selection: Explicit “from” clause identifying the backend (e.g., "EleutherAI/gpt-j-6B").

Example: Joint constrained generation

beam(n=3)
Generate a travel packing list:\n
items = []
for i in range(2):
  "- [THING]\n"
  items.append(THING)
Main item: [MAIN].
from "EleutherAI/gpt-j-6B"
where
  THING in ["passport","phone","keys",…] and len(words(THING))<=2
  and stops_at(MAIN, ".")

This executes joint beam search over all holes, enforcing both membership and stopping criteria proactively.

Constraint Specification

LMQL supports constraints in the WHERE clause expressed as logical predicates:

Membership: VAR in ["A", "B", ...]
Length constraints: len(words(VAR)) < k, len(VAR) < m
Output delimiters: stops_at(VAR, ".")
Call to pure Python functions (e.g., regular expressions)

Built-in predicates yield eager, token-level enforcement; user-defined predicates may require fall-back to backtracking.

3. Constraint Semantics and Decoding

LMQL employs a constraint-based decoding architecture that integrates output predicates into the token generation loop, applying them at each step to determine the validity of continuations.

Eager masking: For each prefix, LMQL computes a binary mask $m \in \{0,1\}^{|\mathcal{V}|}$ over the vocabulary, setting $m_t=0$ for any token $t$ that would induce a constraint violation, based on the WHERE clause.
The updated logits $\tilde{z} = m \odot z$ yield a masked probability distribution:

$P(t | \text{prefix}) = \frac{\exp(\tilde{z}_t)}{\sum_j \exp(\tilde{z}_j)}$

Final vs. follow semantics: The runtime annotates predicates with finality status and computes follow maps $\text{FollowMap}(u, t)$ to detect when a token $t$ will force all future continuations to fail the constraints, permitting early termination (Brzozowski-style soundness).

This formalism ensures a rigorous, tractable method for producing only valid outputs under specified constraints, reducing computational waste.

4. Inference Architecture and Optimization

LMQL’s runtime parses code into a custom AST and constructs a computational graph for both prompt execution and constraints. The decoding loop modifies backend LLM generation to invoke:

Constraint mask computation.
Logit adjustment via mask application.
Token selection via argmax, sampling, or beam logic.

Key optimizations include:

Batching: Multiple samples or beams are decoded in lockstep, batched per generation step.
Caching: Pure deterministic computations of constraints or external tool results are cached between beams and data points.
Speculative decoding: Mask calculation may be pipelined or interleaved with GPU compute to minimize host-side latency.
Cost savings: Let $N_\text{base}$ denote the baseline (chunk-wise) token count, and $N_\text{LMQL}$ the count under LMQL. The empirical cost savings formula is:

$\text{Savings}_{\%} = 100\% \times \left(1 - \frac{N_\text{LMQL}}{N_\text{base}}\right)$

Observed savings are 26–85% across diverse tasks.

5. Workflows and Empirical Performance

LMQL enables a variety of advanced LLM workflows:

Interactive QA with Tools (ReAct style):

import wiki_utils
sample(n=1)
"Q: {QUESTION}\n"
"Thought: [THOUGHT]\n"
"Action: [ACT]\n"
if ACT=="Search":
  result = wiki_utils.search(THOUGHT)
  "Observation: {result}\n"
  continue
elif ACT=="Finish":
  break
from "gpt2-xl"
where
  THOUGHT.endswith(".") and ACT in ["Search","Finish"]

The agent alternates between model-internal reasoning and external information retrieval, with constraints ensuring correct action tokens.

Structured extraction with hard constraints:

Forcing LLM output to match a strict regex, e.g., ISO date format.

Empirical results, excerpted from three case studies (Beurer-Kellner et al., 2022):

Task	Model/API Calls Reduction	Token Reduction	Accuracy Change
Chain-of-Thought (Odd/Date)	–25% to –30%	27–31%	+1 to 0pp (selective)
ReAct QA (HotpotQA subset)	–80% decoder, –36.7% queries	–76%	Maintained
Arithmetic (GSM8K+Calc util)	–85.7% decoder, –66% queries	–85%	Maintained

This demonstrates that LMQL reduces computation and API billing, while output constraints can raise or maintain accuracy via template conformance.

6. LMQL in Database-Oriented and Hybrid Querying

The use of LMQL-inspired paradigms extends to database querying over LLMs. Systems such as Galois (Saeed et al., 2023) describe a “DB-first” approach, turning SQL query plans into physical operator sequences that map directly to LLM-prompt invocations:

LLMScan: Retrieve keys (e.g., entity identifiers) from the LLM by prompt-based enumeration.
LLMFilter: Apply predicate filtering on attributes via prompts per key.
LLMFetch: On-demand attribute fetching for missing fields, aligning LLM outputs to relation schemas.

This pipeline ensures that physical database operators become small, schema-constrained prompt templates, maintaining SQL semantics while leveraging LLM “knowledge.” Early results show Galois combines high selectivity (up to 80% cell accuracy) for selection-only queries with a robust cost model. A plausible implication is that as LLM factuality and context length improve, SQL–LLM hybrids will become powerful general-purpose query interfaces.

7. Scope, Limitations, and Research Directions

LMQL is most effective for:

Multi-step workflows, including chain-of-thought, structured extraction, and tool invocation.
Tasks with strict output format requirements.
Scenarios with pay-per-token or high-inference-cost LMs.

Plain prompting suffices for one-shot or simple completions without downstream constraints.

Outstanding research challenges include:

Enrichment of constraint languages (e.g., grammar-based or automata-based constraints).
Extension to further model backends, standardizing LMQL as a “next-token logits + mask” interface.
Integration with probabilistic programming for sampling over constrained priors.
Optimized prompt planning, cost-aware decomposition, and automated provenance tracking.
Schema-less, composable querying, and techniques for robust, cross-model answer stability.

Ongoing empirical and architectural improvements target verification, coverage, and fairness in hybrid database–LMQL systems (Beurer-Kellner et al., 2022, Saeed et al., 2023).

LMQL operationalizes “prompt programming”: elevating the practice from manual string editing to high-level, semantically controlled, and optimized programmatic querying of LLMs. Across both natural language and structured data domains, it enables expressive, efficient, and safer harnessing of the capabilities of large-scale LMs.

PDF Markdown Chat (Pro)

References (2)

Prompting Is Programming: A Query Language for Large Language Models (2022)

Querying Large Language Models with SQL (2023)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Language Model Query Language (LMQL).