AI-Aware Query Optimization

Updated 17 November 2025

AI-aware query optimization is a set of techniques that explicitly models LLM inference costs and predicate selectivity to optimize query plans for hybrid workloads.
The approach leverages dynamic cost modeling, runtime feedback, and cascade architectures to balance CPU, GPU, and latency trade-offs.
Empirical results demonstrate 2–8× query latency reductions and 30–60% GPU cost savings compared to traditional, LLM-unaware optimizers.

AI-aware query optimization is a family of methodologies and system architectures that explicitly model and optimize the cost, latency, and selectivity characteristics of AI-powered operations—especially LLM inference—during query planning. Unlike classical relational optimizers, which focus on minimizing CPU, I/O, or memory costs for structured data and operators with well-characterized performance, AI-aware approaches incorporate the variable, high-latency, and high-cost nature of semantic operators directly into the query optimization process. The need for these techniques arises in hybrid workloads that blend structured and unstructured data processing, unified by semantic operators (e.g., AI_FILTER, AI_CLASSIFY) and user-facing requirements for declarative analytics over diverse modalities.

1. Modeling LLM Inference Cost and Predicate Selectivity

In AI-aware query optimization, each AI operator is treated as a black-box user-defined function whose cost is not trivially estimated from existing catalog statistics. For a semantic operator $o$ with input cardinality $R$ and LLM model $m$ , the per-row cost model is parameterized as: $C_{o}(R) = R\,\bigl(c_{0}(m) + c_{1}(m)\,\mathbb{E}[\ell]\bigr)$ where $c_0(m)$ denotes per-invocation overhead (e.g., RPC, model setup), $c_1(m)$ the per-token cost (e.g., GPU credits/token), and $\mathbb{E}[\ell]$ the expected tokens processed per row.

For cascaded inference—where a cheap proxy model handles most rows and escalates uncertain cases to a more powerful oracle model—the cost model further refines to: $C_{\mathrm{cascade}}(R) = R\,\bigl[c_{0}^{\mathrm{proxy}} + c_{1}^{\mathrm{proxy}} \mathbb{E}[\ell]\bigr] + R\,p_{\mathrm{oracle}}\,\bigl[c_{0}^{\mathrm{oracle}} + c_{1}^{\mathrm{oracle}} \mathbb{E}[\ell]\bigr]$ where $p_{\mathrm{oracle}} = 1-p_{\mathrm{proxy}}$ is the fraction of rows routed to the oracle. Empirical selectivity $\hat{\sigma}_o$ for an AI predicate $\varphi$ is initially estimated by lightweight sampling: $\hat\sigma_o = \frac{1}{n_s}\sum_{i=1}^{n_s} I\{\varphi(x_i)=1\}$ followed by runtime feedback for dynamic adjustment. This reflects the central uncertainty in AI-augmented predicates, for which selectivity is unpredicted at compile time and must be bootstrapped hierarchically.

2. Integration of AI Cost Models in Query Planning

AI-aware query optimizers, such as the one implemented in Cortex AISQL (Aggarwal et al., 10 Nov 2025), extend Volcano-style cost models to treat LLM costs as intrinsic planning objectives. Every operator $o$ in a plan carries both traditional SQL cost $C_{\mathrm{sql}}(o, R_o)$ (CPU, I/O) and LLM cost $C_{\mathrm{LLM}}(o, R_o)$ . Total plan cost becomes: $C(P) = \sum_{o\in P} \left[ C_{\mathrm{sql}}(o, R_o) + C_{\mathrm{LLM}}(o, R_o) \right]$ Plan enumeration includes AI-specific variants (e.g., push-down vs. pull-up of AI_FILTER across joins and projections). The optimizer’s dynamic programming or greedy search evaluates trade-offs such as pushing AI_FILTER below a join (incurring high LLM cost but reducing join input cardinality) versus delaying it to reduce the number of expensive LLM invocations at the cost of joining more rows.

The plan selection criterion then shifts from minimizing classical join I/O or CPU to minimizing an explicit, weighted sum of CPU cost, GPU (LLM) cost, and—optionally—latency: $\text{score} = w_{\mathrm{cpu}}\cdot \text{CPU} + w_{\mathrm{gpu}}\cdot \text{GPU} + w_{\mathrm{lat}}\cdot \text{latency}$ These weights are user-tunable, enabling multi-objective optimization that accommodates production deployment concerns.

3. Optimizer Architecture and Runtime Adaptivity

To support AI-awareness, the optimizer must implement several critical architectural changes:

Black-box cost interfaces: Each semantic operator exposes a Cost() callback for cost and selectivity estimation.
Extended memoization: Plan enumeration explicitly considers alternative placements for AI operators (e.g., before/after join), storing separate cost entries in the memo.
Runtime statistics & feedback: At execution, actual selectivity and LLM latencies are recorded to refine future cost estimates.
Dynamic predicate reordering: Pipelines with multiple AI_FILTERs are dynamically reordered at runtime based on observed per-predicate cost and selectivity. The system adaptively swaps predicate order after profiling the initial batches of data.
Plan hints and AI cascades: The optimizer can insert cascade skeletons (proxy/oracle) and autotune routing thresholds based on empirical cost/quality trade-offs.

Pseudocode sketch: The core plan selection logic can be compactly encapsulated as:

function ScorePlan(plan P):
    totalCpuCost ← 0
    totalGpuCost ← 0
    maxLatency ← 0
    for each operator o in P do
       R ← P.cardinality(o.input)
       cpu ← estimateCpuCost(o, R)
       (gpu, lat) ← estimateLLMCostAndLatency(o, R)
       totalCpuCost += cpu
       totalGpuCost += gpu
       maxLatency = max(maxLatency, lat)
    end
    score = w_cpu·totalCpuCost + w_gpu·totalGpuCost + w_lat·maxLatency
    return score

function ChooseBestPlan(Query Q):
    plans ← EnumeratePlans(Q)
    bestScore ← +∞
    bestPlan ← null
    for P in plans do
       s ← ScorePlan(P)
       if s < bestScore then
         bestScore ← s;  bestPlan ← P
       end
    end
    return bestPlan

Reoptimization is triggered if the measured costs deviate from the plan by more than a set threshold (e.g., 20%).

4. Empirical Performance and Case Studies

AI-aware optimization demonstrates substantial empirical improvements in production and benchmark settings:

Predicate reordering: For a query on NYT_ARTICLES (1,000 rows), AI-aware reordering yields 3.3×–6.7× speedup, with the largest benefit when selective predicates are applied last, minimizing LLM calls.
Filter placement relative to joins: Benchmarking three strategies—
- Always-pushdown,
- Always-pullup,
- Cost-based AI-aware—
- the cost-based approach consistently outperforms both alternatives, with speedups of 1.9–2.6× under variable join selectivity.
General workload impact: Across customer production logs and synthetic workloads, observed gains include 2–8× reductions in end-to-end query latency and 30–60% cuts to GPU-inference cost against traditional, LLM-unaware optimizers.

These figures directly reflect the non-trivial computational and financial weight of semantic operations, and substantiate the necessity of their integration into cost models.

Scenario	Default Plan	AI-aware Plan	Speedup
Predicate reordering, s=0.1	1.00	0.15	6.7×
AI_FILTER pull-up, r=2.0	1.00	0.56	1.8×
End-to-end latency (typical)	1.00	0.12–0.50	2–8×
GPU-credit cost	1.00	0.40–0.70	30–60% ↓

5. Comparison to Classical Query Optimization

AI-aware query optimization introduces fundamental divergences from classical relational optimization:

Opaque cost and selectivity: LLM operator cost and selectivity cannot be derived from precomputed statistics and must be estimated via runtime feedback and sampling.
Dominant UDF cost: The cost of individual AI operations can dwarf all other plan components, requiring the optimizer to penalize plans with excessive AI operator invocations.
Black-box semantics: LLM inference exhibits high variance, data-dependent behavior, and non-linear scaling with input size, precluding the analytical cost formulas used for scans and joins.
Multi-objective trade-offs: The cost function must balance latency, GPU credits, and computational resources according to user priorities, unlike the unidimensional optimization in classical systems.
Runtime adaptivity: As selectivity and operator performance drift, runtime plan modification (predicate reordering, threshold adjustment in cascades) is required.

These distinctions necessitate a re-engineered optimizer architecture and cost model, departing from push-down heuristics and favoring flexible, feedback-driven operator scheduling.

6. Outlook and Implications

AI-aware query optimization represents a paradigm shift in database system architecture for hybrid, semantic workloads. By treating LLM inference as a first-class optimization target, these systems enable sustainable, scalable analytics over both structured and unstructured data. Key design challenges include:

Continual refinement of black-box cost estimation as hardware, models, and workloads evolve.
Efficient memoization and plan enumeration to handle the combinatorial explosion of AI operator placements.
Ensuring robust runtime adaptivity under changing data and selectivity distributions.
Generalizing plan scoring to user-provided objectives spanning cost, quality, and latency.

Operationally, such optimizers chart the path for high-performance, economically feasible deployment of AI-enhanced analytics platforms. Techniques such as dynamic runtime plan reshaping, cascade skeleton insertion, and explicit user-weighted cost functions will likely generalize to future workloads in analytics, search, and content understanding, as demonstrated in production deployments at scale (Aggarwal et al., 10 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

Cortex AISQL: A Production SQL Engine for Unstructured Data (2025)

Follow Topic

Get notified by email when new papers are published related to AI-Aware Query Optimization.