SemBench: A Benchmark for Semantic Query Processing Engines (2511.01716v1)

Published 3 Nov 2025 in cs.DB and cs.LG

Abstract: We present a benchmark targeting a novel class of systems: semantic query processing engines. Those systems rely inherently on generative and reasoning capabilities of state-of-the-art LLMs. They extend SQL with semantic operators, configured by natural language instructions, that are evaluated via LLMs and enable users to perform various operations on multimodal data. Our benchmark introduces diversity across three key dimensions: scenarios, modalities, and operators. Included are scenarios ranging from movie review analysis to medical question-answering. Within these scenarios, we cover different data modalities, including images, audio, and text. Finally, the queries involve a diverse set of operators, including semantic filters, joins, mappings, ranking, and classification operators. We evaluated our benchmark on three academic systems (LOTUS, Palimpzest, and ThalamusDB) and one industrial system, Google BigQuery. Although these results reflect a snapshot of systems under continuous development, our study offers crucial insights into their current strengths and weaknesses, illuminating promising directions for future research.

Summary

The paper introduces SemBench, a comprehensive benchmark that extends SQL with LLM-powered semantic operators to process multimodal queries.
It evaluates SQPEs across diverse scenarios by measuring execution cost, latency, and result accuracy using text, image, and audio data.
Results reveal critical trade-offs between quality and cost efficiency, highlighting differences in operator implementations among academic and industrial systems.

SemBench: A Benchmark for Semantic Query Processing Engines

Semantic query processing engines (SQPEs) represent an emerging class of data processing systems tailored to handle inquiries that blend traditional database operations with the reasoning abilities of LLMs. The paper "SemBench: A Benchmark for Semantic Query Processing Engines" introduces a benchmark specifically designed to evaluate the capabilities of SQPEs in handling multimodal data queries. This benchmark is structured across diverse scenarios, modalities, and operator types, providing a comprehensive evaluation framework.

Benchmark Design and Dimensions

SemBench was formulated to exploit the potential of SQPEs by extending SQL with semantic operators, which are processed using LLMs. These operators encompass semantic filters, joins, mappings, rankings, and classifications. Each operator harnesses natural language instructions to manage the data processing tasks, thereby enabling complex operations on data types like text, images, and audio, which traditional SQL struggles with.

The benchmark evaluates systems through several distinctive scenarios ranging from text-based movie reviews to high-complexity medical data processing, and wildlife sound/image analyses. It capitalizes on the diversity of modalities and query requirements, assessing performance across different data types and operator functionalities. This approach enhances its utility in reflecting real-world SQPE applications.

Evaluation Methodology

A critical aspect of SemBench is its evaluation of SQPEs based on processing cost, execution time, and result quality. Each semantic operator incurs significant computational overhead due to LLM calls, thus altering traditional query optimization strategies. Benchmark evaluations are conducted on academic systems like LOTUS, Palimpzest, and ThalamusDB, alongside industrial solutions such as Google BigQuery.

Performance is measured using established metrics: execution costs, latency, and an accuracy evaluation against ground truth data derived from manually labeled datasets. This method offers insights into how SQPEs optimize LLM utilization while minimizing costs and maximizing accuracy, a crucial trade-off area in SQPE development.

Results and Observations

The benchmark results show varied performance across evaluated systems, highlighting the strengths and weaknesses inherent in current SQPE implementations. Notably, results revealed substantial differences in execution costs and quality levels depending on operator implementations and the nature of queries—especially evident in queries combining semantic joins or leveraging large-scale LLM interactions.

While BigQuery generally exhibited high result quality, academic systems such as LOTUS demonstrated efficiency in cost management through innovative implementations like embedding-based approximations for joins. Palimpzest showed robust result quality but faced challenges in cost efficiency.

Implementation Insights

SQPEs must navigate several implementation choices, including prompt design, operator optimization, and caching strategies. The complexity of these systems is compounded by the stochastic nature of LLM outputs, necessitating robust prompt engineering to ensure accuracy and efficiency. Techniques such as embedding-based approaches, batching, and termination condition settings proved vital for optimizing operator implementations.

Additionally, the benchmark accentuates the critical role of efficient multimodal support, which is nascent across most SQPEs, emphasizing the need for broadening LLM capabilities to encompass diverse data processing requirements seamlessly.

Conclusion

SemBench provides a detailed evaluation framework tailored to the unique demands of SQPEs, contributing significantly to advancing the understanding and development of these systems. As SQPE capabilities expand, SemBench serves as a vital tool to chart performance improvements and guide future research. Continuous updates to the benchmark, informed by evolving SQPE features and applications, will ensure it remains central to driving semantic query processing innovation.