LMQL: Efficient Language Model Querying

Updated 25 March 2026

LMQL is a domain-specific language that transforms prompt engineering into a programmable paradigm using static templates, holes, and scripting constructs.
It integrates components like static prompt fragments, variable placeholders, and declarative constraints to efficiently manage interactive flows with reduced API calls.
Empirical evaluations show LMQL significantly reduces computational cost and token usage (by up to 85%) while maintaining or improving output accuracy.

LMQL (LLM Query Language) is a domain-specific programming language and execution framework designed to generalize conventional prompt engineering into a programmable paradigm, termed LLM Programming (LMP). By interleaving static prompt templates, variable placeholders (holes), scripting constructs, and declarative output constraints, LMQL enables the specification and efficient execution of complex interactive flows and constrained generation over large LMs, yielding substantial reductions in computational cost and number of model API calls with minimal or no loss in output accuracy (Beurer-Kellner et al., 2022).

1. Foundations and Formal Specification

LLM Programming frames prompting as writing programs, not mere strings, where both task logic and LM constraints are handled compositionally. Traditional prompt usage abstracts LMs as “black-box” next-token predictors with string-sequence I/O; LMP instead integrates:

Static prompt fragments
Holes (placeholders for LM-generated strings)
Imperative scripting (variables, loops, branches, function calls)
Declarative constraints over outputs

An LMQL query is formally defined as a 5-tuple:

$Q = (D, B, M, W, G)$

where:

$D$ is the decoding strategy: $\mathtt{argmax}$ , $\mathtt{sample}(n)$ , or $\mathtt{beam}(n)$
$B$ is the prompt body, expressed in a restricted Python-like syntax
$M$ is the model identifier string (e.g., "gpt-j-6B")
$W$ is a boolean constraint (“where-clause”) over hole variables
$G$ is an optional distribution clause

Execution constructs an interaction trace $u \in V^*$ (concatenation of static prompts and LM outputs) with a scope $\sigma$ mapping holes to strings.

2. Syntax, Grammar, and Semantics

The core LMQL grammar can be summarized as:

<query> ::= <decoder> <body> from <model> [ where <cond> ] [ distribute <var> over <expr> ]
<decoder> ::= "argmax" | "sample(n=<int>)" | "beam(n=<int>)"
<body> ::= (<stmt>)+
<stmt> ::= <text> | for <py-var> in <py-expr>: | if <cond>: ... [ elif ... ] [ else ... ] | <py-stmt>
<text> ::= STRING_LITERAL

Within string literals, [X] denotes a hole, and {X} interpolates a previously assigned variable.

Semantically, LMQL eagerly executes code in an imperative order. Encountering a [X] hole, the interpreter splits the prompt at the hole, sends the prefix to the LM, and invokes a constrained decode (see Section 3 below), populating $\sigma[X]$ and appending the result to the interaction trace. Variable interpolation {X} simply inserts the current value.

A sketch of this process is:

def eval_string(s, u, σ):
    if "[X]" in s:
        sₚ, X, sₛ = split(s)
        u += sₚ
        v = decode(u, σ)  # constrained decoding
        σ[X] = v
        u += v + sₛ
    elif "{X}" in s:
        v = σ[X]
        u += s.replace("{X}", v)
    else:
        u += s
    return u, σ

3. Constraints and Eager Enforcement

Constraint enforcement in LMQL is realized via the where-clause, a boolean formula over hole variables. Built-in functions include:

words(X): splits $\sigma[X]$ into tokens
stops_at(X, s): forces the LM to end generation of $X$ when $s$ is emitted
$X \in L$ : constrains $\sigma[X]$ to a set $L$

For each constraint expression in $W$ , an annotation $[e] \in \{\langle fin, v\rangle, \langle var\rangle, \langle inc\rangle, \langle dec\rangle\}$ classifies its state—fixed, variable, monotonically increasing/decreasing. If any conjunct in $W$ is fixed to $\langle fin, False\rangle$ , the decode branch aborts immediately (“short-circuiting”).

Token-level validation is accomplished via “FollowMaps,” which determine, for each partial trace $u$ and candidate token $t$ , whether extending $u$ with $t$ would make $W$ definitely false. If so, $m_t \leftarrow 0$ masks out that token. This mechanism is related to the Brzozowski derivative over the language of valid outputs.

A simplified sketch of constrained decoding:

def decode(u, σ):
    v = ""
    while True:
        m = compute_mask(u, σ, v)  # from W and FollowMap
        if all(mi == 0 for mi in m): break
        p = softmax(f(u+v)) * m  # apply mask
        t = pick_token(p, D)
        if t == eos: break
        v += t
    return detokenize(v)

4. Compilation Pipeline and Execution Strategies

LMQL source code is compiled into a Python abstract syntax tree plus metadata. The prompt body ( $B$ ) is mapped to a generator yielding top-level string literals. A dedicated runtime iterates over these yields, invoking the above evaluation logic.

For multi-path decoding ( $D = sample(n)$ or $beam(n)$ ), LMQL executes $n$ parallel Python interpreter contexts in lock-step, enabling efficient batching of model inference calls. This approach significantly minimizes the number of expensive LM API invocations compared to naïve per-iteration calls.

Empirical measurements, using metrics $C_{gen}$ (model calls), $T_{gen}$ (tokens processed in baseline) and $C_{lmql}$ , $T_{lmql}$ for LMQL, demonstrate $C_{lmql}, T_{lmql} \approx 15–75\%$ of baseline figures, corresponding to $25$– $85\%$ reduction in both call overhead and billable token usage.

5. Illustrative Workflows

LMQL’s expressive power is evident in its ability to specify and efficiently execute complex prompt programs. Key workflow types include:

Example	LMQL Features Used	Model
Constrained QA (Odd-One-Out)	Static prompt, holes, output constraints	EleutherAI/gpt-j-6B
Interactive ReAct	Scripting, external tool calls, loops	gpt2-xl
Calculator Integration	Hooks to Python functions, loops	gpt-j-6B

Constrained QA: Using argmax decoding and a where-clause to constrain intermediate reasoning and answers to valid tokens.
Interactive ReAct: Implements search/action-observation paradigms with dynamic branching, tool API calls, and eagerly validated dialogue turns.
Arithmetic Tool-Augmentation: Interleaves token-level reasoning steps with Python-side arithmetic, feeding external results back into the generation stream—all executed in a single, end-to-end decode.

6. Empirical Evaluation and Baseline Comparison

Benchmarked on tasks such as Chain-of-Thought “Odd One Out,” BIG benchmark date understanding, HotpotQA (ReAct), and GSM8K arithmetic, various LMs (including GPT-J-6B, OPT-30B, gpt2-xl, and GPT-3.5 Davinci) demonstrate:

Equal or improved accuracy ( $+0$ – $+1.2$ % absolute in QA tasks)
Decoder/API call reduction (up to $86\%$ fewer calls)
Token consumption reduction (up to $85\%$ fewer tokens)
Program concise—$9$–$22$ LOC for LMQL vs. $30$–$80$ LOC in baseline Python

Task	Accuracy (%)	Decoder Calls	Billable Tokens	Relative Savings
Odd One Out	33.3→34.5	7.9→6.0 (–25%)	1179→861 (–27%)	0.63¢/query
Date Understanding	22.9→22.9	9.8→6.8 (–30%)	4131→2845 (–31%)	2.57¢/query
HotpotQA (ReAct)	—	5→1 (–80%)	3404→807 (–76%)	5.2¢/query
Arithmetic (GSM8K)	—	7→1 (–86%)	3649→550 (–85%)	6.2¢/query

A plausible implication is that for a broad class of structured generation, interactive reasoning, and tool-augmented pipelines, LMQL provides both resource efficiency and higher-level programmability over conventional prompt APIs.

7. Limitations and Future Directions

Current LMQL features primarily support “hard” constraints; “soft” constraints (such as gradient-based logit biases) require additional research and integration. Some advanced constraints, depending on their complexity, may trigger runtime backtracking. Broad deployment also depends on support for token-level masking hooks in underlying LM APIs, which is not yet standardized across all vendors. Further extensibility and optimizations may yield enhanced expressiveness and more fine-grained control over both semantic and surface-level generation behaviors.

LMQL systematically raises prompt engineering to a programmable, compositional abstraction. By compiling high-level prompt programs to efficient, token-level constrained LM inference, workflows that formerly required substantial ad hoc coding and multiple LM invocations can be realized in a declarative, cost-efficient manner, often with improved semantic precision and tractable end-to-end latency (Beurer-Kellner et al., 2022).

Markdown Report Issue Upgrade to Chat

References (1)

Prompting Is Programming: A Query Language for Large Language Models (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LMQL.