PrediQL: Enhanced Query Expressiveness

Updated 19 October 2025

PrediQL is a framework that unifies declarative SQL predicates, semantic search, and adaptive LLM-guided fuzzing to improve query expressiveness and data integration.
It leverages program synthesis and intermediate representations to automatically translate ORM logic into optimized SQL, achieving significant performance gains.
Advanced techniques such as predicate pushdown, lineage inference, and multi-armed bandit strategies provide robust, scalable optimization and vulnerability detection.

PrediQL refers to a set of approaches for enhancing query expressiveness, optimization, data quality, integration of structured and semantic predicates, lineage inference, and advanced automated testing—most recently culminating in LLM-augmented tools for complex API fuzzing. PrediQL methodologies encompass declarative similarity predicates for data cleaning, program synthesis from ORM logic, intermediate representations for query portability, integration of semantic search into SQL, predicate-optimized lineage tracing, and adaptive LLM-guided fuzzing. The following sections detail these methodological innovations and their empirical impact.

1. Declarative Similarity Predicates and Data Quality Primitives

PrediQL emphasizes expressing approximate selection and join predicates for data cleaning entirely in SQL, promoting modularity and system-agnostic integration (0907.2471). Central to this framework is the definition and realization of a broad spectrum of similarity predicates:

Overlap Predicates: Simple set intersection–based measures such as IntersectSize and Jaccard, expressed in SQL via joins and aggregations on tokenized data. These support high-throughput, low-latency matching but disregard token importance.
Aggregate Weighted Predicates: Incorporate token weighting, e.g., tf–idf cosine similarity and BM25, using tables of normalized weights and SQL summations. BM25, parameterized by $k_1, k_3, b$ , captures document length and frequency normalization.
LLM and HMM Predicates: Probabilistic IR-based similarity functions; for instance the LLM predicate,

$\operatorname{sim}_{lm}(Q, D) = \prod_{t \in Q} \hat{p}(t \mid M_D) \prod_{t \notin Q} (1 - \hat{p}(t \mid M_D))$

Here, $\hat{p}(t \mid M_D)$ combines smoothed maximum-likelihood and collection-level estimates, and can be implemented entirely in SQL over precomputed frequency tables.

Edit-Based and Combination Predicates: Strategies such as edit distance and SoftTFIDF merge token-level similarity with underlying edit operations, with UDFs used for complex operations after SQL-based candidate filtering.

Performance evaluations demonstrate that probabilistic and aggregate weighted predicates achieve superior accuracy (e.g., high MAP and max $F_1$ ), while overlap predicates are computationally more scalable but less robust to token importance and data errors (0907.2471).

2. Program Synthesis and Intermediate Representations

PrediQL approaches include the automatic translation of imperative ORM-based application code into optimized SQL using program synthesis (Cheung et al., 2012). Here the QBS algorithm identifies application fragments involving data selection, projection, joining, and aggregation:

Theory of Ordered Relations: A formalism close to SQL models list operations and join/aggregation semantics, e.g., recursively:

$o(h : t, f) = \begin{cases} h : o(t, f) & \text{if } f(h) = \text{True} \ o(t, f) & \text{otherwise} \end{cases}$

Synthesis and Verification: SKETCH-based inference produces loop invariants and postconditions, which are validated against the code using Hoare-style axioms and then systematically translated into SQL.
Practical Impact: In empirical studies, synthesizing SQL from Java Hibernate code led to significant performance improvements, especially for loop-heavy logic not optimally handled by ORMs. However, synthesis complexity may increase for fragments with nontrivial data dependency or external references.

Portability and language independence are realized using intermediate representations such as QIR—a lambda-calculus–based calculus extended with query operators (filter, join, aggregation) (Vernoux, 2016). QIR allows application- and database-agnostic query portability, supports analytical reduction for optimizing query representations, and provides formal guarantees of transformation optimality via a measure $M(e) = (\mathrm{Op}(e) - \mathrm{Comp}(e), \mathrm{Frag}(e))$ .

3. Integration of Semantic Predicates in SQL

PrediQL encompasses frameworks that bridge classical SQL with semantic and multimodal queries. The SSQL (Semantic SQL) framework extends the PrediQL paradigm by adding a SEMANTIC keyword, unifying structured queries and semantic search within an SQL syntax (Mittal et al., 5 Apr 2024):

Architecture: Combines a relational database, an SQL extension layer, a vector-based semantic engine (e.g., CLIP and FAISS), and a human-in-the-loop optimization loop.
Execution Pipeline: SQL predicates are executed first to filter candidates; semantic similarity (via normalized embeddings and Euclidean distance) is then computed over the filtered results. Formulaically,

$v' = \frac{v}{\|v\|}, \quad d(a, b) = \sqrt{\sum_i (a_i - b_i)^2}$

Optimization: Interactive binary search algorithms quickly home in on optimal similarity thresholds via user feedback.
Empirical Evaluation: Using only semantic queries failed catastrophically (over 60% failure) for count and spatial queries, reinforcing the necessity of combined predicate optimization for robustness. Human feedback improved the balance of recall and precision for complex, multi-modal queries, surpassing both pure SQL and pure semantic search baselines.

4. Predicate Optimization and Efficient Lineage Inference

PrediQL solutions tackle key optimization challenges in query execution, particularly for column-oriented systems and lineage tracing:

Predicate Disjunction Optimization: BestD/Update is a polynomial-time, provably optimal algorithm for minimizing the predicate evaluation cost in column-stores with disjunctive predicates (Kim et al., 2020). Cost models assume predicate evaluation far outweighs set operations and satisfy essential monotonicity and triangle-inequality–like properties, e.g.,

$C_P(D \cup E) < C_P(D) + C_P(E)$

Algorithmic Composition: When paired with the Hanani ordering, BestD/Update becomes EvalPred, which achieves $O(n \log^2 n)$ planning time and provides optimal plans for Boolean predicate trees of depth 2. Empirical results show up to 28× speedup over exponential-time algorithms on synthetic and TPC-H/CH-benchmark queries.
Lineage Tracing via Predicate Pushdown: In data pipelines, PredTrace infers fine-grained row-level lineage by pushing row-selection predicates downward through the operator DAG (Lin et al., 22 Dec 2024). The core idea is to construct and propagate a predicate $F^{\text{row}} = (\text{col}_1 = v_1 \wedge \text{col}_2 = v_2 \wedge ...)$ , with precise or approximate migration depending on operator properties.
- For example, in TPC-H Q4 the lineage is traced by pushing predicates through Sort, GroupBy, and SemiJoin, saving intermediate results or refining predicates as necessary.
- PredTrace achieves coverage on all 22 TPC-H queries and speeds up lineage queries up to 10× or more compared to prior approaches, with storage overhead reduction up to 99% and query time reductions by 270× in sampled real-world pipelines.

5. LLM-Augmented Automated Testing and Adaptive Query Exploration

Recent PrediQL research focuses on LLM-guided, retrieval-augmented fuzzing for exhaustive and adaptive GraphQL API exploration (Liu et al., 12 Oct 2025):

LLM-Guided Fuzzer Architecture: Combines prompts constructed with modular schema fragments, execution traces, and historical error–query pairs—enabling semantically valid and context-aware generation of queries.
Multi-Armed Bandit Strategy Selection: The choice of prompting strategy is modeled as a multi-armed bandit optimization, maximizing coverage of schema operations and exploiting successful strategies over time. Thompson Sampling balances exploration (novel queries) and exploitation (successful test patterns).

$\text{Coverage} = \frac{\# \text{Unique Successful Responses}}{\# \text{Unique Schema Nodes}}$

Context-Aware Vulnerability Detection: The system employs LLMs to interpret not only query responses but also error messages and status codes to detect injection, access-control, and information-disclosure vulnerabilities.
Retrieval-Augmented Self-Correction: A FAISS index supports efficient retrieval of execution and error traces for prompt enrichment in subsequent iterations, enabling rapid model correction and progressive learning.
Empirical Results: Evaluated across open-source and benchmark APIs (e.g., UserWallet, Countries, TCGDex), PrediQL achieved up to 100% schema coverage and significantly higher vulnerability discovery rates—average 16% improvement, with up to 50% improvement on complex targets—relative to baseline fuzzers such as ZAP, BurpSuite, EvoMaster, and GraphQLer.

6. Future Directions and Implications

PrediQL methodologies illustrate the convergence of declarative data quality primitives, program synthesis, semantic search integration, advanced query optimization, and model-driven adaptive testing:

The unified querying paradigm—combining SQL predicates with semantic, multi-modal, and programmatically derived predicates—positions PrediQL as a central approach for handling heterogeneous data and complex analytical requirements.
Predicate pushdown, intermediate representations, and context-aware reasoning underlie scalable solutions for lineage, integration, and security.
The integration of human-in-the-loop feedback and LLM-augmented automation extends dynamic discoverability and correctness far beyond previous static or heuristics-driven systems.
Empirical benchmarks reinforce the necessity of hybrid query methodologies and adaptive optimization for reliability, data quality, and exploit coverage in evolving data environments.

A plausible implication is that as data systems grow in complexity—encompassing structured, unstructured, and multimodal content—the techniques embodied in PrediQL will form the basis for next-generation, expressive, and robust data and API management platforms.