LMQL: Efficient Language Model Querying
- LMQL is a domain-specific language that transforms prompt engineering into a programmable paradigm using static templates, holes, and scripting constructs.
- It integrates components like static prompt fragments, variable placeholders, and declarative constraints to efficiently manage interactive flows with reduced API calls.
- Empirical evaluations show LMQL significantly reduces computational cost and token usage (by up to 85%) while maintaining or improving output accuracy.
LMQL (LLM Query Language) is a domain-specific programming language and execution framework designed to generalize conventional prompt engineering into a programmable paradigm, termed LLM Programming (LMP). By interleaving static prompt templates, variable placeholders (holes), scripting constructs, and declarative output constraints, LMQL enables the specification and efficient execution of complex interactive flows and constrained generation over large LMs, yielding substantial reductions in computational cost and number of model API calls with minimal or no loss in output accuracy (Beurer-Kellner et al., 2022).
1. Foundations and Formal Specification
LLM Programming frames prompting as writing programs, not mere strings, where both task logic and LM constraints are handled compositionally. Traditional prompt usage abstracts LMs as “black-box” next-token predictors with string-sequence I/O; LMP instead integrates:
- Static prompt fragments
- Holes (placeholders for LM-generated strings)
- Imperative scripting (variables, loops, branches, function calls)
- Declarative constraints over outputs
An LMQL query is formally defined as a 5-tuple:
where:
- is the decoding strategy: , , or
- is the prompt body, expressed in a restricted Python-like syntax
- is the model identifier string (e.g., "gpt-j-6B")
- is a boolean constraint (“where-clause”) over hole variables
- is an optional distribution clause
Execution constructs an interaction trace (concatenation of static prompts and LM outputs) with a scope mapping holes to strings.
2. Syntax, Grammar, and Semantics
The core LMQL grammar can be summarized as:
1 2 3 4 5 |
<query> ::= <decoder> <body> from <model> [ where <cond> ] [ distribute <var> over <expr> ] <decoder> ::= "argmax" | "sample(n=<int>)" | "beam(n=<int>)" <body> ::= (<stmt>)+ <stmt> ::= <text> | for <py-var> in <py-expr>: | if <cond>: ... [ elif ... ] [ else ... ] | <py-stmt> <text> ::= STRING_LITERAL |
[X] denotes a hole, and {X} interpolates a previously assigned variable.
Semantically, LMQL eagerly executes code in an imperative order. Encountering a [X] hole, the interpreter splits the prompt at the hole, sends the prefix to the LM, and invokes a constrained decode (see Section 3 below), populating and appending the result to the interaction trace. Variable interpolation {X} simply inserts the current value.
A sketch of this process is:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
def eval_string(s, u, σ): if "[X]" in s: sₚ, X, sₛ = split(s) u += sₚ v = decode(u, σ) # constrained decoding σ[X] = v u += v + sₛ elif "{X}" in s: v = σ[X] u += s.replace("{X}", v) else: u += s return u, σ |
3. Constraints and Eager Enforcement
Constraint enforcement in LMQL is realized via the where-clause, a boolean formula over hole variables. Built-in functions include:
words(X): splits into tokensstops_at(X, s): forces the LM to end generation of when is emitted: constrains to a set
For each constraint expression in , an annotation classifies its state—fixed, variable, monotonically increasing/decreasing. If any conjunct in is fixed to , the decode branch aborts immediately (“short-circuiting”).
Token-level validation is accomplished via “FollowMaps,” which determine, for each partial trace and candidate token , whether extending with would make definitely false. If so, masks out that token. This mechanism is related to the Brzozowski derivative over the language of valid outputs.
A simplified sketch of constrained decoding:
1 2 3 4 5 6 7 8 9 10 |
def decode(u, σ): v = "" while True: m = compute_mask(u, σ, v) # from W and FollowMap if all(mi == 0 for mi in m): break p = softmax(f(u+v)) * m # apply mask t = pick_token(p, D) if t == eos: break v += t return detokenize(v) |
4. Compilation Pipeline and Execution Strategies
LMQL source code is compiled into a Python abstract syntax tree plus metadata. The prompt body () is mapped to a generator yielding top-level string literals. A dedicated runtime iterates over these yields, invoking the above evaluation logic.
For multi-path decoding ( or ), LMQL executes parallel Python interpreter contexts in lock-step, enabling efficient batching of model inference calls. This approach significantly minimizes the number of expensive LM API invocations compared to naïve per-iteration calls.
Empirical measurements, using metrics (model calls), (tokens processed in baseline) and , for LMQL, demonstrate of baseline figures, corresponding to $25$– reduction in both call overhead and billable token usage.
5. Illustrative Workflows
LMQL’s expressive power is evident in its ability to specify and efficiently execute complex prompt programs. Key workflow types include:
| Example | LMQL Features Used | Model |
|---|---|---|
| Constrained QA (Odd-One-Out) | Static prompt, holes, output constraints | EleutherAI/gpt-j-6B |
| Interactive ReAct | Scripting, external tool calls, loops | gpt2-xl |
| Calculator Integration | Hooks to Python functions, loops | gpt-j-6B |
- Constrained QA: Using
argmaxdecoding and awhere-clause to constrain intermediate reasoning and answers to valid tokens. - Interactive ReAct: Implements search/action-observation paradigms with dynamic branching, tool API calls, and eagerly validated dialogue turns.
- Arithmetic Tool-Augmentation: Interleaves token-level reasoning steps with Python-side arithmetic, feeding external results back into the generation stream—all executed in a single, end-to-end decode.
6. Empirical Evaluation and Baseline Comparison
Benchmarked on tasks such as Chain-of-Thought “Odd One Out,” BIG benchmark date understanding, HotpotQA (ReAct), and GSM8K arithmetic, various LMs (including GPT-J-6B, OPT-30B, gpt2-xl, and GPT-3.5 Davinci) demonstrate:
- Equal or improved accuracy (–% absolute in QA tasks)
- Decoder/API call reduction (up to fewer calls)
- Token consumption reduction (up to fewer tokens)
- Program concise—$9$–$22$ LOC for LMQL vs. $30$–$80$ LOC in baseline Python
| Task | Accuracy (%) | Decoder Calls | Billable Tokens | Relative Savings |
|---|---|---|---|---|
| Odd One Out | 33.3→34.5 | 7.9→6.0 (–25%) | 1179→861 (–27%) | 0.63¢/query |
| Date Understanding | 22.9→22.9 | 9.8→6.8 (–30%) | 4131→2845 (–31%) | 2.57¢/query |
| HotpotQA (ReAct) | — | 5→1 (–80%) | 3404→807 (–76%) | 5.2¢/query |
| Arithmetic (GSM8K) | — | 7→1 (–86%) | 3649→550 (–85%) | 6.2¢/query |
A plausible implication is that for a broad class of structured generation, interactive reasoning, and tool-augmented pipelines, LMQL provides both resource efficiency and higher-level programmability over conventional prompt APIs.
7. Limitations and Future Directions
Current LMQL features primarily support “hard” constraints; “soft” constraints (such as gradient-based logit biases) require additional research and integration. Some advanced constraints, depending on their complexity, may trigger runtime backtracking. Broad deployment also depends on support for token-level masking hooks in underlying LM APIs, which is not yet standardized across all vendors. Further extensibility and optimizations may yield enhanced expressiveness and more fine-grained control over both semantic and surface-level generation behaviors.
LMQL systematically raises prompt engineering to a programmable, compositional abstraction. By compiling high-level prompt programs to efficient, token-level constrained LM inference, workflows that formerly required substantial ad hoc coding and multiple LM invocations can be realized in a declarative, cost-efficient manner, often with improved semantic precision and tractable end-to-end latency (Beurer-Kellner et al., 2022).