Papers
Topics
Authors
Recent
2000 character limit reached

LOTUS Semantic Operators Overview

Updated 10 December 2025
  • LOTUS semantic operators are formal extensions of classical relational operators that leverage LLMs to interpret natural language 'langexes' for expressive data processing.
  • They generalize filtering, joining, mapping, ranking, and classification by replacing strict Boolean predicates with semantic similarity and generative tasks.
  • Advanced optimization strategies, including cascade filtering and embedding prefiltering, ensure efficiency and accuracy at manageable computational costs.

LOTUS semantic operators are formal extensions of classical relational database operators, designed to perform complex, LLM-driven data transformations over structured and multimodal data within a declarative framework. By interpreting natural-language specifications—termed "langexes"—through LLMs, these operators decouple intent from implementation, enabling flexible, expressive, and optimizable data processing analogous to SQL but operating with semantic predicates, similarity, and generative tasks as core primitives. They subsume and generalize algebraic query relaxation operators from earlier cooperative answering systems and are implemented in the LOTUS system, which provides theoretically grounded semantics, optimization, accuracy guarantees, and empirical evaluations on real-world AI tasks (Wiese, 2012, Patel et al., 16 Jul 2024, Lao et al., 3 Nov 2025).

1. Foundational Concepts and Evolution

LOTUS semantic operators advance the tradition of cooperative query answering, originally characterized by algebraic generalization operators—Dropping Condition (DC), Anti-Instantiation (AI), and Goal Replacement (GR)—that systematically relax query predicates to recover informative results from failing or over-constrained queries. In this framework, operators are defined at the level of relational algebra, allowing principled manipulation and ranking of result sets by semantic similarity metrics (Wiese, 2012).

The modern incarnation builds on this by replacing syntactic relaxations with semantic criteria. Semantic operators generalize selection, join, aggregation, and ranking so that their behavior is specified not by Boolean or symbolic expressions but by natural-language instructions evaluated by LLMs (Patel et al., 16 Jul 2024). This paradigm enables complex reasoning, extraction, and matching tasks over diverse data modalities and domains, all accessible via a declarative and optimizable API (Lao et al., 3 Nov 2025).

2. Catalog of LOTUS Semantic Operators

LOTUS implements five semantic operators, each governed by a natural-language specification and executed—either exactly or via efficient, accuracy-guaranteed approximations—through LLMs:

Operator Purpose Formal Semantics
sem_filter Row filtering by LLM-evaluated predicate sem_filter(ℓ)(T)={t∈T∣M(ℓ)(t)=1}\mathrm{sem\_filter}(\ell)(T) = \{ t\in T \mid M(\ell)(t)=1 \}
sem_join Semantic join by LLM-specified row matching sem_join(ℓ)(T1,T2)={(t1,t2)∣M(ℓ)(t1,t2)=1}\mathrm{sem\_join}(\ell)(T_1,T_2) = \{(t_1,t_2) \mid M(\ell)(t_1,t_2)=1\}
sem_map Row transformation via free-form LLM mapping sem_map(ℓ)(T)={M(ℓ)(t)∣t∈T}\mathrm{sem\_map}(\ell)(T) = \{M(\ell)(t) \mid t\in T\}
sem_rank (sem_topk) Top-kk/ordering by LLM-based scoring/comparison sem_rank(ℓ,k)(T)=⟨t1,…,tk⟩\mathrm{sem\_rank}(\ell,k)(T) = \langle t_1,\dots,t_k \rangle ordered by M(ℓ)(⋅)M(\ell)(\cdot)
sem_classify LLM-based classification into user-defined categories sem_classify(ℓ)(T)={(t,M(ℓ)(t))∣t∈T}\mathrm{sem\_classify}(\ell)(T) = \{(t, M(\ell)(t)) \mid t\in T\}

Each operator supports text and image input, configurable LLM backends, and additional tuning parameters such as temperature and embedding-based approximations (Lao et al., 3 Nov 2025).

3. Logical Semantics and Query Algebra

The behavior of each LOTUS operator is precisely specified by a "gold" algorithm, defining the operator's output as the result of idealized LLM computations:

  • sem_filter: Applies a Boolean natural-language predicate â„“:X→{0,1}\ell:X\to\{0,1\}, returning all rows tt such that M(â„“)(t)=1M(\ell)(t)=1.
  • sem_join: For input tables T1,T2T_1, T_2 and condition â„“:(X,Y)→{0,1}\ell:(X,Y)\to\{0,1\}, produces pairs for which M(â„“)(t1,t2)=1M(\ell)(t_1, t_2)=1. Optimized versions allow a vector-embedding similarity pre-filter.
  • sem_map: Maps each input row tt to an LLM-generated output via a mapping instruction â„“:X→Y\ell:X\to Y.
  • sem_rank / sem_topk: Ranks or selects the top-kk rows according to an LLM-based scoring or pairwise comparison function. Several algorithmic variants (mapping-sort, full pairwise compare, quickselect) are supported.
  • sem_classify: Assigns each row to a category from a finite set, guided by a classification instruction.

Chaining these operators produces a directed acyclic graph (DAG) query plan analogous to relational algebra pipelines, but capable of expressing complex, integrated LLM-based tasks (Patel et al., 16 Jul 2024).

4. Optimization Strategies and Accuracy Guarantees

LOTUS incorporates a declarative optimization framework that minimizes total query cost, balancing LLM invocation overhead, vector search latency, and in-process computation:

  • Cascade Filtering: Applies a fast, low-precision model for initial screening; low-confidence cases are escalated to a more accurate, slower model. Aggregate error rates are bounded: for threshold Ï„\tau and false negative rate α\alpha from the small model, the total is ≤α+δ\leq \alpha + \delta (with δ\delta from the large model) (Patel et al., 16 Jul 2024).
  • Embedding-Prefiltered Joins: Semantic joins leverage dense vector embeddings (e.g., e5-base-v2 for text, CLIP-ViT-B-32 for images) for candidate pruning, reducing the number of expensive LLM calls.
  • Approximate Top-kk: Supports quickselect-based selection and batching strategies that ensure the probability of returning the true top-kk decays with the number of comparisons omitted, precisely quantified via Hoeffding-style bounds.
  • Plan Rewriting and Batching: Logical plan DAGs are rewritten and executed in large batches with shared prompts to saturate parallel GPUs, amortizing token-level costs.

5. Practical Usage and Syntax

End users access LOTUS operators through a Python (Pandas-like) or pseudo-SQL interface, supplying natural-language prompts with {column}-style placeholders. Example queries include:

  • Filtering:
    1
    2
    3
    4
    
    SELECT * FROM Papers WHERE sem_filter(
      'the {abstract} claims to outperform {baseline}',
      Papers.abstract
    );
  • Semantic join:
    1
    2
    
    SELECT i.city, a.city FROM ImageTable i JOIN AudioTable a
      ON sem_join('the image and audio contain the same elephant', i.image, a.audio);
  • Top-kk ranking and aggregation:
    1
    
    reviews_df.sem_topk(5, "the {text} is most positive", group_by=["product_id"])

The system supports all data modalities, and typical workloads (e.g., movie review sentiment analysis or E-commerce attribute extraction) achieve near-perfect quality at sub-\$0.10 cost and sub-10 s latency for filter/classify/map operations; joins and ranking incur higher cost/latency, especially without approximation (Lao et al., 3 Nov 2025).

6. Empirical Results and Benchmarks

Systematic evaluations on the SemBench benchmark and real-world datasets demonstrate the expressiveness and efficiency of LOTUS’ semantic operators (Lao et al., 3 Nov 2025, Patel et al., 16 Jul 2024):

  • Empirical performance: In the SemBench Movies scenario (2 000 rows, g-2.5-flash model), sem_filter achieves average F1≈0.82 with 10.1 s latency; sem_classify achieves ARI=0.93 (0.004 cost, 2.3 s); semantic joins incur higher latency (F1=0.59, latency 536 s), but aggressive embedding filtering provides an 800×800\times speedup over naïve joins.
  • Expressiveness: LOTUS succinctly expresses benchmark pipelines, including multi-stage fact-checking, biomedical label assignment, and document ranking, within a few lines.
  • Comparison: Across 55 benchmark queries, average per-query cost is \$0.62, with average quality≈0.68 and latency≈148 s. sem_filter and sem_map nearly saturate quality (>0.95) at lowest cost/latency.

A plausible implication is that the combination of semantic operators and advanced cost-based optimization allows LOTUS to scale LLM-driven analytics with consistent accuracy and manageable resource consumption, outperforming imperative LLM scripting and previous batch pipelines (Patel et al., 16 Jul 2024, Lao et al., 3 Nov 2025).

7. Relationship to Classical Query Relaxation

Classical LOTUS operators—Dropping Condition, Anti-Instantiation, and Goal Replacement—correspond to semantic variants in the modern model:

  • Dropping Condition (DC): Generalizes filter predicates by removal; sem_filter achieves analogous relaxation using a weaker, LLM-evaluated natural language predicate.
  • Anti-Instantiation (AI): Loosens equality or selection constraints; sem_join and sem_filter generalize this operation by replacing fixed value comparisons with semantic similarity or flexible LLM-based predicates.
  • Goal Replacement (GR): Substitutes subgoals according to deductive rules; sem_map enables LLM-driven value transformation and summarization, generalizing GR to arbitrary generative mappings.

Both frameworks attach graded similarity or relevance degrees to relaxed answers; the algebraic model uses domain-specific similarity metrics (Wiese, 2012), while LOTUS relies on underlying LLM/embedding-based scoring. Chaining and propagating similarity (e.g., multiplying or taking minimums across operator steps) remains central to ranking and filtering informative answers in both settings.

References:

Whiteboard

Follow Topic

Get notified by email when new papers are published related to LOTUS Semantic Operators.