Papers
Topics
Authors
Recent
Search
2000 character limit reached

LMQL: Efficient Language Model Querying

Updated 25 March 2026
  • LMQL is a domain-specific language that transforms prompt engineering into a programmable paradigm using static templates, holes, and scripting constructs.
  • It integrates components like static prompt fragments, variable placeholders, and declarative constraints to efficiently manage interactive flows with reduced API calls.
  • Empirical evaluations show LMQL significantly reduces computational cost and token usage (by up to 85%) while maintaining or improving output accuracy.

LMQL (LLM Query Language) is a domain-specific programming language and execution framework designed to generalize conventional prompt engineering into a programmable paradigm, termed LLM Programming (LMP). By interleaving static prompt templates, variable placeholders (holes), scripting constructs, and declarative output constraints, LMQL enables the specification and efficient execution of complex interactive flows and constrained generation over large LMs, yielding substantial reductions in computational cost and number of model API calls with minimal or no loss in output accuracy (Beurer-Kellner et al., 2022).

1. Foundations and Formal Specification

LLM Programming frames prompting as writing programs, not mere strings, where both task logic and LM constraints are handled compositionally. Traditional prompt usage abstracts LMs as “black-box” next-token predictors with string-sequence I/O; LMP instead integrates:

  • Static prompt fragments
  • Holes (placeholders for LM-generated strings)
  • Imperative scripting (variables, loops, branches, function calls)
  • Declarative constraints over outputs

An LMQL query is formally defined as a 5-tuple:

Q=(D,B,M,W,G)Q = (D, B, M, W, G)

where:

  • DD is the decoding strategy: argmax\mathtt{argmax}, sample(n)\mathtt{sample}(n), or beam(n)\mathtt{beam}(n)
  • BB is the prompt body, expressed in a restricted Python-like syntax
  • MM is the model identifier string (e.g., "gpt-j-6B")
  • WW is a boolean constraint (“where-clause”) over hole variables
  • GG is an optional distribution clause

Execution constructs an interaction trace uVu \in V^* (concatenation of static prompts and LM outputs) with a scope σ\sigma mapping holes to strings.

2. Syntax, Grammar, and Semantics

The core LMQL grammar can be summarized as:

1
2
3
4
5
<query> ::= <decoder> <body> from <model> [ where <cond> ] [ distribute <var> over <expr> ]
<decoder> ::= "argmax" | "sample(n=<int>)" | "beam(n=<int>)"
<body> ::= (<stmt>)+
<stmt> ::= <text> | for <py-var> in <py-expr>: | if <cond>: ... [ elif ... ] [ else ... ] | <py-stmt>
<text> ::= STRING_LITERAL
Within string literals, [X] denotes a hole, and {X} interpolates a previously assigned variable.

Semantically, LMQL eagerly executes code in an imperative order. Encountering a [X] hole, the interpreter splits the prompt at the hole, sends the prefix to the LM, and invokes a constrained decode (see Section 3 below), populating σ[X]\sigma[X] and appending the result to the interaction trace. Variable interpolation {X} simply inserts the current value.

A sketch of this process is:

1
2
3
4
5
6
7
8
9
10
11
12
13
def eval_string(s, u, σ):
    if "[X]" in s:
        sₚ, X, sₛ = split(s)
        u += sₚ
        v = decode(u, σ)  # constrained decoding
        σ[X] = v
        u += v + sₛ
    elif "{X}" in s:
        v = σ[X]
        u += s.replace("{X}", v)
    else:
        u += s
    return u, σ

3. Constraints and Eager Enforcement

Constraint enforcement in LMQL is realized via the where-clause, a boolean formula over hole variables. Built-in functions include:

  • words(X): splits σ[X]\sigma[X] into tokens
  • stops_at(X, s): forces the LM to end generation of XX when ss is emitted
  • XLX \in L: constrains σ[X]\sigma[X] to a set LL

For each constraint expression in WW, an annotation [e]{fin,v,var,inc,dec}[e] \in \{\langle fin, v\rangle, \langle var\rangle, \langle inc\rangle, \langle dec\rangle\} classifies its state—fixed, variable, monotonically increasing/decreasing. If any conjunct in WW is fixed to fin,False\langle fin, False\rangle, the decode branch aborts immediately (“short-circuiting”).

Token-level validation is accomplished via “FollowMaps,” which determine, for each partial trace uu and candidate token tt, whether extending uu with tt would make WW definitely false. If so, mt0m_t \leftarrow 0 masks out that token. This mechanism is related to the Brzozowski derivative over the language of valid outputs.

A simplified sketch of constrained decoding:

1
2
3
4
5
6
7
8
9
10
def decode(u, σ):
    v = ""
    while True:
        m = compute_mask(u, σ, v)  # from W and FollowMap
        if all(mi == 0 for mi in m): break
        p = softmax(f(u+v)) * m  # apply mask
        t = pick_token(p, D)
        if t == eos: break
        v += t
    return detokenize(v)

4. Compilation Pipeline and Execution Strategies

LMQL source code is compiled into a Python abstract syntax tree plus metadata. The prompt body (BB) is mapped to a generator yielding top-level string literals. A dedicated runtime iterates over these yields, invoking the above evaluation logic.

For multi-path decoding (D=sample(n)D = sample(n) or beam(n)beam(n)), LMQL executes nn parallel Python interpreter contexts in lock-step, enabling efficient batching of model inference calls. This approach significantly minimizes the number of expensive LM API invocations compared to naïve per-iteration calls.

Empirical measurements, using metrics CgenC_{gen} (model calls), TgenT_{gen} (tokens processed in baseline) and ClmqlC_{lmql}, TlmqlT_{lmql} for LMQL, demonstrate Clmql,Tlmql1575%C_{lmql}, T_{lmql} \approx 15–75\% of baseline figures, corresponding to $25$–85%85\% reduction in both call overhead and billable token usage.

5. Illustrative Workflows

LMQL’s expressive power is evident in its ability to specify and efficiently execute complex prompt programs. Key workflow types include:

Example LMQL Features Used Model
Constrained QA (Odd-One-Out) Static prompt, holes, output constraints EleutherAI/gpt-j-6B
Interactive ReAct Scripting, external tool calls, loops gpt2-xl
Calculator Integration Hooks to Python functions, loops gpt-j-6B
  • Constrained QA: Using argmax decoding and a where-clause to constrain intermediate reasoning and answers to valid tokens.
  • Interactive ReAct: Implements search/action-observation paradigms with dynamic branching, tool API calls, and eagerly validated dialogue turns.
  • Arithmetic Tool-Augmentation: Interleaves token-level reasoning steps with Python-side arithmetic, feeding external results back into the generation stream—all executed in a single, end-to-end decode.

6. Empirical Evaluation and Baseline Comparison

Benchmarked on tasks such as Chain-of-Thought “Odd One Out,” BIG benchmark date understanding, HotpotQA (ReAct), and GSM8K arithmetic, various LMs (including GPT-J-6B, OPT-30B, gpt2-xl, and GPT-3.5 Davinci) demonstrate:

  • Equal or improved accuracy (+0+0+1.2+1.2% absolute in QA tasks)
  • Decoder/API call reduction (up to 86%86\% fewer calls)
  • Token consumption reduction (up to 85%85\% fewer tokens)
  • Program concise—$9$–$22$ LOC for LMQL vs. $30$–$80$ LOC in baseline Python
Task Accuracy (%) Decoder Calls Billable Tokens Relative Savings
Odd One Out 33.3→34.5 7.9→6.0 (–25%) 1179→861 (–27%) 0.63¢/query
Date Understanding 22.9→22.9 9.8→6.8 (–30%) 4131→2845 (–31%) 2.57¢/query
HotpotQA (ReAct) 5→1 (–80%) 3404→807 (–76%) 5.2¢/query
Arithmetic (GSM8K) 7→1 (–86%) 3649→550 (–85%) 6.2¢/query

A plausible implication is that for a broad class of structured generation, interactive reasoning, and tool-augmented pipelines, LMQL provides both resource efficiency and higher-level programmability over conventional prompt APIs.

7. Limitations and Future Directions

Current LMQL features primarily support “hard” constraints; “soft” constraints (such as gradient-based logit biases) require additional research and integration. Some advanced constraints, depending on their complexity, may trigger runtime backtracking. Broad deployment also depends on support for token-level masking hooks in underlying LM APIs, which is not yet standardized across all vendors. Further extensibility and optimizations may yield enhanced expressiveness and more fine-grained control over both semantic and surface-level generation behaviors.


LMQL systematically raises prompt engineering to a programmable, compositional abstraction. By compiling high-level prompt programs to efficient, token-level constrained LM inference, workflows that formerly required substantial ad hoc coding and multiple LM invocations can be realized in a declarative, cost-efficient manner, often with improved semantic precision and tractable end-to-end latency (Beurer-Kellner et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LMQL.