Multi-Query Parallelism
- Multi-query parallelism is a design paradigm that concurrently processes distinct queries over shared data to optimize resource allocation and share intermediate results.
- It leverages advanced scheduling techniques—such as level-based heuristics, dynamic programming, and reinforcement learning—to reduce latency and enhance throughput.
- Empirical results demonstrate speedups of 2×–16× in diverse systems including relational databases, graph engines, streaming platforms, and LLM-augmented retrieval pipelines.
Multi-query parallelism refers to the systematic, concurrent processing of multiple distinct queries—typically over a shared data substrate—such that computation, resource allocation, and intermediate result sharing are jointly optimized across the workload. This design paradigm is central to a variety of domains including distributed relational and graph databases, streaming analytics, document parsing, and retrieval-augmented generation with LLMs. Multi-query parallelism raises unique challenges in scheduling, resource management, synchronization, consistency, and workload decomposition, motivating diverse algorithmic and architectural strategies in modern research.
1. Foundational Models and Scheduling Frameworks
Early work on multi-query parallelism, particularly in shared-nothing architectures, established formal resource models and scheduling approaches that are foundational for subsequent systems. In “Multi-Resource Parallel Query Scheduling and Optimization” (Garofalakis et al., 2014), Garofalakis and Ioannidis develop a multi-dimensional resource model that incorporates both time-shared (TS, e.g., CPU, disk) and space-shared (SS, e.g., memory) resources per site. Queries are represented as collections of operator pipelines, each decomposed into “clones” with associated TS work vectors and SS memory demand vectors.
A level-based list scheduling heuristic (LEVELSCHED) is introduced, which partitions pipelines into SS-feasible layers and applies a packing-based scheduling policy to minimize makespan:
- For identical sites and a set of clones , the lower bound on optimal makespan is
- The worst-case performance ratio is shown to depend on TS and SS resource counts and “granularity” , demonstrating that near-optimal multi-query parallelism is possible even under complex blocking constraints and operator interdependencies.
Dynamic scheduling extensions handle the arrival of online query tasks and blocking operator behaviors (e.g., hash-join materialization). Empirical evaluation shows multi-resource scheduling can yield 2–16× speedup over naïve per-query or hierarchical schemes in resource-constrained settings, with robust performance to granularity parameters.
2. Optimized Multi-query Execution in Relational Systems
In the context of distributed analytical engines, “Multi Query Optimization in GLADE” (Rafay, 2016) provides a concrete framework for maximizing both data reuse and throughput in the execution of SQL workloads. The core problem is to find, given a batch , a single join plan (modeled as a DAG with multiple “exit” pointers) that satisfies all queries while minimizing total work .
The key approach is a cost-based dynamic programming algorithm that, for each subplan over relations , tracks:
- The best plan cost and the set of queries satisfied (“satList”).
- Selectivity estimation and intermediate cardinality computation.
This technique allows for maximal sharing of scans, selections, and particularly join states across concurrent queries. Considerations include push-down selection, greedy tie-breaking in join orderings, and aggressive subplan pruning.
Empirical setups—e.g., TPC-H queries across a 9-node GLADE cluster—show up to 2× reduction in response time versus naïve per-query execution. The implementation uses generalized linear aggregate (GLA) pipelines with fine-grained operator state sharing and serialization/deserialization to minimize network load.
3. Multi-query Parallelism in LLM-Augmented Retrieval and Reasoning
Recent work has extended multi-query parallelism beyond classical databases to retrieval-augmented generation (RAG) and reasoning agents. In “RAG-R1: Incentivize the Search and Reasoning Capabilities of LLMs through Multi-query Parallelism” (Tan et al., 30 Jun 2025), the RAG pipeline is generalized to accept, for an input , a set of parallel queries, each issuing external retrievals and generating conditional answer distributions. The outputs are aggregated (e.g., weighted log-probabilities) to synthesize the final answer:
0
The architectural pipeline exploits concurrent retrieval, batched conditional generation, and aggregation, yielding a measured 11.1% reduction in wall-clock inference time (A100 GPU, 1), and a 13.2% increase in exact-match accuracy over strong RL baselines across benchmark QA datasets. Reinforcement learning (PPO-based) is used to encourage effective query generation and aggregation.
Similarly, “ParallelSearch: Train your LLMs to Decompose Query and Search Sub-queries in Parallel with Reinforcement Learning” (Zhao et al., 12 Aug 2025) trains LLM agents to recognize and explicitly decompose parallelizable questions (e.g., comparison tasks), issuing multiple independent search sub-queries in parallel. The RL reward function combines accuracy, decomposition, parallel-execution efficiency, and format incentives:
2
Empirically, this approach yields a 12.7% gain on parallelizable tasks and reduces LLM query calls to 69.6% of the sequential baseline, while maintaining or improving accuracy on both parallel and bridge-style questions.
4. Multi-query Parallelism in Graph and Stream Processing
Graph-processing systems commonly face scenarios with large numbers of overlapping or independent queries. In distributed and multicore graph engines, multi-query parallelism raises the challenge of balancing joint resource utilization, minimizing synchronization overhead, and adapting to heterogeneity in both query complexity and graph structure.
- In “Scheduling of Graph Queries: Controlling Intra- and Inter-query Parallelism for a High System Throughput” (Hauck et al., 2021), the proposed engine dynamically samples degree statistics, derives per-query parallelization parameters, and adaptively partitions work into packages, switching between sequential and fully parallel execution depending on frontier sizes and cost model predictions. The dynamic scheduler yields robust throughput (always within 5–10% of a manually optimized static strategy) with up to 2.5× improvement under high-concurrency loads.
- In streaming analytics, multi-query parallelism intersects with concurrency and consistency. “QPOPSS: Query and Parallelism Optimized Space-Saving for Finding Frequent Stream Elements” (Jarlow et al., 2024) presents a concurrent data structure supporting simultaneous updates and (potentially concurrent) frequent-elements queries across 3 threads. The core design integrates:
- Domain splitting, with each thread owning a disjoint subdomain and local min-max heap counter.
- Lock-free delegation filters for inter-thread update routing.
- Nonblocking per-thread try-locks enabling overlapping queries and updates.
The result is linear throughput scaling up to hundreds of threads, sub-1004s query latency, and strict memory admissibility (5 total space). Theoretical bounds establish that concurrency-induced estimation errors are negligible in the limit (6), and empirical performance surpasses previous multithreaded baselines.
5. Parallelism via Query Batching and Region Decoding in Structured Data Extraction
The concept of multi-query parallelism is also instantiated in structured document parsing, as detailed in “Youtu-Parsing: Perception, Structuring and Recognition via High-Parallelism Decoding” (Yin et al., 28 Jan 2026). Here, after visual feature extraction, a batch of up to 7 detected regions (bounding boxes) is submitted in a single prompt to the LLM decoder, which returns a concatenated sequence, subsequently split by <sep> tokens:
- The query-batched approach amortizes computational cost and leverages token-level parallelism (e.g., 64 candidate tokens per region-step).
- Ideal speedup for 8 regions is up to 9, with observed %%%%2021%%%% acceleration at 2 due to practical prompt and hardware limitations. Combined with token parallelism, this yields an overall multiplicative end-to-end speedup (totaling 5–11×).
- Crucially, parsing accuracy (edit distance, semantic fidelity) is not degraded by batching, confirming independence among region queries.
This mechanism exemplifies a recurrent theme: multi-query batching leverages stateless, independent query semantics to extract maximum hardware and computational efficiency without compromising correctness.
6. Optimization Objectives, Scheduling Trade-offs, and Practical Impact
The optimization of multi-query parallelism is application- and system-sensitive. Key trade-offs—reflected throughout the literature—include:
- Resource packing vs. contention: E.g., in (Garofalakis et al., 2014), SS packing constraints dominate scheduling under tight memory; in (Hauck et al., 2021), synchronization cost 3 is a primary optimization target.
- Accuracy and query latency: In streaming analytics (Jarlow et al., 2024), domain-splitting and lock-free queuing balance minimal query blocking against global recall and precision.
- Plan-sharing vs. individual tailoring: In distributed SQL (Rafay, 2016), shared plan generation achieves maximal data reuse but may be constrained by variations in selectivity; boundary conditions and pruning heuristics are leveraged for tractability.
Empirical results consistently show that—when dependencies permit—multi-query parallelism offers up to 24–55 speedup for throughput, 36–57 reduction in latency, and SOTA accuracy in document and QA tasks.
7. Limitations and Extensions
Challenges persist: workload or query dependency (as in multi-hop or bridge-style QA) can limit parallelizability; heterogeneity in query structure or data distribution demands adaptivity in both scheduling and reward assignment. Many frameworks rely on hardware- or topology-specific parameters; generalization to GPU/FPGAs or extreme-scale clusters may require substantial reengineering. Fully dynamic insertion and incremental optimization of queries remains an open direction in shared execution engines (Rafay, 2016), as does robust support for aggregates, window functions, and heterogeneous computation in multi-query LLM frameworks.
The field continues to converge on the principle that principled workload decomposition, explicit sharing of intermediate state, and concurrency-aware scheduling are essential for scalable, efficient, and interpretable multi-query parallelism across modern data and AI systems.