Auxiliary Query Optimization

Updated 8 September 2025

Auxiliary query optimization is a set of methodologies that enhance traditional query planning by integrating additional data, alternative cost models, and side channels.
It refines join processing and execution strategies using advanced techniques such as expanded cost models and auxiliary data structures to minimize redundant computations.
It also leverages Bayesian methods, auxiliary objectives, and large language models to improve plan selection and runtime efficiency in complex query environments.

Auxiliary query optimization is a class of methodologies, algorithms, and systems that exploit additional information, structures, objectives, or side channels to improve the efficiency, robustness, and adaptivity of classic query optimization processes. It encompasses approaches that use supplementary data (such as binary auxiliary functions, unsupervised query clusters, or system state information), auxiliary objective functions, alternative matrix representations, and even LLMs to yield superior plans or runtime behavior relative to conventional cost-based optimizers. Auxiliary query optimization is highly interdisciplinary, drawing on research from database systems, machine learning, information theory, compiler optimization, and probabilistic modeling.

1. Expanded Cost Models and Many-to-Many Join Optimization

A key motivation for auxiliary query optimization is the recognition that traditional cost models—often based solely on selectivity and fixed join order enumeration—fail for workloads involving many-to-many joins or cyclic queries, such as those encountered in complex analytical or graph workloads. The introduction of survival probabilities and fanout explicitly models redundancy in join processing. For a driver tuple traversing a series of joins with match probabilities $m_i$ and fanouts $fo_i$ , the product $s_i = m_i \cdot fo_i$ is insufficient to capture the repeated or postponed probe cost prevalent in factorized or compressed intermediate representations.

A refined cost estimate is given by:

$\text{Cost} = N \times (1 + m_2 \cdot fo_2 + m_2 \cdot fo_2 \cdot m_3 \cdot fo_3 + \ldots)$

where each term models the surviving driver tuples at each join stage and avoids overcounting redundant probes (Kalumin et al., 20 Dec 2024). This cost framework supports optimization algorithms (dynamic programming and robust greedy heuristics) that are explicitly designed to minimize redundant computation, materializations, and probe traffic, especially in the context of bitvector-based early pruning or full semi-join reductions. The resultant robustness analysis demonstrates that these auxiliary-informed models “flatten” the sensitivity of plan quality to errors in selectivity estimation, allowing more stable and simpler optimization even in the absence of perfect statistics.

2. Bayesian and Side-Channel-Informed Optimization

Auxiliary query optimization also refers to processes that use inexpensive surrogate information to steer the optimization of expensive objectives. A representative framework is mixed-type Bayesian optimization with binary auxiliary functions (Zhang et al., 2019), which jointly considers a costly, continuous target $f_1$ and one or more binary auxiliary functions $f_i$ . The multi-output Gaussian process model,

$f_i(x) = m_i + \int K_i(x - x') L(x') dx',$

enables accurate estimation of the correlations between $f_1$ and the auxiliary signals. Information-based acquisition functions—e.g., predictive entropy search (MT-PES):

$\alpha(y_X, \langle x, i \rangle) = H(y_i(x) | y_X) - \mathbb{E}_{x^* \sim p(x^*|y_X)}[H(y_i(x) | y_X, x^*)]$

guide the optimization to evaluate the next candidate—either on the expensive target or a cheaper auxiliary function—based on maximum expected information gain, adjusted by per-evaluation costs. The joint GP/posterior is approximated via expectation propagation and random feature expansion, enabling scalable inference and constraint propagation for alignment between binary surrogates and global targets (e.g., in hyperparameter tuning and reinforcement learning policy search).

3. Auxiliary Structures: Pruning, Caching, and Parallelization

The construction and judicious use of auxiliary data structures—such as auxiliary graphs, join DAGs, and history structures—are central techniques. For example, GraphMini (Liu et al., 2 Mar 2024) builds auxiliary graphs by proactively pruning adjacency lists based on materialized prefix sets $C_h(v_k)$ and $C_h(v_i)$ at loop depth $h$ :

$P_{h,k}(u \mid v_i) = N(u) \cap C_h(v_i)$

where $N(u)$ is the original neighbors of node $u$ . These auxiliary graphs minimize set intersection costs for future candidate generation in subgraph matching, with online cost models deciding whether the up-front pruning cost is amortized in future gains:

$g(I_h, k, u, i) = |e(I_h, k, u)| (|N(u)| - |C_h(v_i) \cap N(u)|) - (|C_h(v_i)| + |N(u)|)$

Such structures facilitate parallelism (nested-loop decomposition), reduce peak memory, and accelerate repetitive sub-query workloads by caching or compressing computation (e.g., factorized join plans(Kalumin et al., 20 Dec 2024), optimized union-of-conjunctive-query rewritings with caching and elimination(Gottlob et al., 2014), or materialized subquery results in staged re-optimization(Pavlopoulou et al., 2020, Zhao et al., 2022)).

4. Multitask, Auxiliary Objective, and Hybrid Preference Optimization

Auxiliary objectives—in the form of secondary losses, auxiliary labels, or reward functions—are incorporated into optimization pipelines via multitask or hybrid learning schemes. In neural search ranking, joint training on primary (ranking) and auxiliary (query cluster prediction) tasks using an objective

$L(\Theta) = \frac{1}{|Q|} \sum_{q \in Q} \left[ l^{\text{rank}}(q) + \lambda \cdot l^{\text{cluster}}(q) \right]$

drives better feature sharing and regularization, enabling unsupervised query clustering (via hierarchical SVD and varimax rotation) to propagate as an auxiliary signal through the network (e.g., in large-scale email ranking (Shen et al., 2018)).

For LLM alignment and preference optimization, auxiliary designer objectives $r_i(x, y)$ are combined with user preference signals $r_p(x, y)$ :

$R(x, y) = \sum_i \alpha_i r_i(x, y)$

$A_t(x, y) = R(x, y) - V_t(x)$

$\max_\phi~ \mathbb{E}_{x, y}\left[ r_p(x, y) + \alpha A_t(x, y) \right] - \beta KL(\pi_\phi || \pi_{\text{ref}})$

enabling LLMs to be aligned for safety, readability, or policy objectives using a unified, MLE-based loss (Badrinath et al., 28 May 2024). This approach generalizes direct preference optimization and offline RL into a hybrid auxiliary query optimization paradigm for controllable LLM training.

5. LLMs as Optimization Primitives

Recent frameworks have explored the use of LLMs as primary or auxiliary query optimizers (Yao et al., 10 Mar 2025). In LLMOpt, LLMs are fine-tuned using supervision on serialized query hints and system statistics to (a) generate candidate plan hints directly (LLMOpt(G)), and (b) execute list-wise, global selection of query plans (LLMOpt(S)):

LLMOpt(G) leverages LLMs’ generative capabilities to sample diverse, high-quality candidates, avoiding massive heuristic search.
LLMOpt(S) treats query candidate selection as a global contextual ranking problem, outputting the index of the best plan from a list in a single forward pass.

The models are trained by maximizing cross-entropy over ground-truth hints or indices, incorporating auxiliary database statistics such as cardinality, histograms, and value distributions as input features. Empirically, LLMOpt outperforms cost-model and neural ranking baselines (BAO, HybridQO, PostgreSQL heuristics) by 40–67% reduced execution latency on JOB, JOB-EXT, and Stack benchmarks, with selection accuracy exceeding 60–75% on held-out sets. This suggests auxiliary query optimization via LLMs can robustly internalize complex plan evaluation signals and adapt to evolving workload or data distributions.

6. Integration, System Implications, and Limitations

Auxiliary query optimization frameworks require careful integration into existing DBMS, query processors, and compiler stacks. Key system implications include:

Lowered sensitivity to plan selection—robust performance even with noisy statistics (Kalumin et al., 20 Dec 2024).
Ability to amortize computation across queries (multi-query sharing, data lineage/provenance systems(Niu et al., 2017), workload-aware caching).
Inclusion of code-generation, query compilation, and parallelized search that leverages auxiliary data structures and signals during runtime (Liu et al., 2 Mar 2024).

However, challenges remain: integration overhead (especially when LLM components are involved), parameter tuning in auxiliary/mixed-objective frameworks, resource requirements, and potential for performance degradation when auxiliary data are misaligned or when underlying models require retraining to accommodate distribution shift. A plausible implication is that ongoing research will focus on seamless system integration, formalizing robustness guarantees, and adaptive hybridization of multiple auxiliary sources and structures.

7. Future Directions

Future research in auxiliary query optimization is oriented toward:

Expanding the use of auxiliary models and constraints for robust, explainable, and adaptive optimization in heterogeneous and distributed systems.
Building dynamic, online frameworks that judiciously allocate computational resources (early pruning, staged re-optimization, materialization) based on auxiliary side information and evolving state, especially in streaming and cloud environments.
Integrating multi-modal and multi-objective auxiliary data for more tractable, holistic optimization in both transactional and analytical workloads.
Investigating stability, sensitivity and representation adaptation in hybrid systems where auxiliary and primary objectives may occasionally be in partial conflict.

Overall, auxiliary query optimization is set to deepen its foundational and practical significance as data and query complexities advance, bridging optimization, machine learning, and system design.