Query-Based Data Splits: Methods & Applications

Updated 1 September 2025

Query-based data splits are techniques for partitioning datasets using query patterns and group semantics to achieve representative and balanced splits.
Algorithms such as GBS, GISA, and GQSA employ greedy, top-down approaches to minimize expected query counts by reducing entropy and preserving group integrity.
Practical applications include improving emergency response, toxic chemical identification, system troubleshooting, and interactive search through efficient query structuring.

Query-based data splits refer to strategies for partitioning datasets that are explicitly motivated by the structure, semantics, or performance requirements of downstream queries, often with the objective of optimizing model training, evaluation, inference, or rapid decision-making. The defining characteristic is that the splitting scheme is not arbitrary or solely random, but instead leverages specific knowledge about query patterns, data features, groupings, task requirements, or evaluation constraints. In applied contexts such as emergency response, database systems, recommendation evaluation, multi-label learning, or experimental benchmarking, query-based splits are used to either (a) reduce the number of queries required to identify the desired target or group, (b) ensure representative or balanced splits for evaluation, or (c) optimize physical data layout or resource usage for query processing.

1. Group Partitioning and Query Grouping in Query-Based Splits

A central principle in query-based data splitting—exemplified by "Group-based Query Learning for rapid diagnosis in time-critical situations" (0911.4511)—is the explicit use of group partitioning for both objects and queries:

Object Group Partitioning: The object set $\Theta = \{\theta_1, \theta_2, ..., \theta_M\}$ is partitioned into $m$ disjoint groups labeled by a vector $y = (y_1, ..., y_M)$ , with $y_i \in \{1, ..., m\}$ . For instance, chemicals may be grouped into "pesticides," "corrosive acids," etc., forming $\{\Theta^1, ..., \Theta^m\}$ where $\Theta^i$ are objects in group $i$ .
Query Grouping: The query set $Q$ is partitioned into $n$ groups, with each query $q_k$ assigned group label $z_k \in \{1, ..., n\}$ .

Partitioning both objects and queries allows splitting algorithms to focus on group-level identification (e.g., determining an object's group rather than its singleton identity) and, in human-in-the-loop contexts, to present coherent groups of queries (such as symptom groups) for selection.

In scenarios with persistent noise, the binary response matrix is "dilated" so each object is replaced by all strings within $\varepsilon$ errors, casting the noise-robust identification problem as a group identification task over these noisy versions.

2. Algorithms for Query-Based Data Splits

Classic algorithms for query-based splitting generalize decision tree construction and binary search. For example:

Generalized Binary Search (GBS): At each decision node, select the query $a$ that most evenly splits the remaining probability mass ( $\pi$ ), minimizing the reduction factor $\rho_a = \max(\pi(\Theta_l(a)), \pi(\Theta_r(a)))/\pi(\Theta_a)$ . The expected query count is:

$\mathbb{E}[K]=H(\pi)+\sum_{a \in I}\pi(\Theta_a)[1-H(\rho_a)]$

where $H(\cdot)$ is binary entropy and $I$ is the set of internal nodes.

Group Identification Splitting Algorithm (GISA): To address group identification where intra-group responses may differ, GISA introduces a group-specific reduction factor:

$\rho_a^i = \max(\pi(\Theta^i_l(a)), \pi(\Theta^i_r(a)))/\pi(\Theta^i_a)$

The expected query cost is:

$\mathbb{E}[K]=H(\pi_y)+\sum_{a \in I}\pi(\Theta_a)\left[1-H(\rho_a)+\sum_{i=1}^m\frac{\pi(\Theta^i_a)}{\pi(\Theta_a)} H(\rho^i_a)\right]$

Here, $\pi_y$ is the distribution of group probabilities.

Group Queries Splitting Algorithm (GQSA/GIGQSA): For scenarios where query groups are presented and the user selects a question, splitting is based on minimizing expected entropy reduction across candidates with respect to group assignments.

All these approaches proceed via greedy, top-down recursion, selecting splits that minimize expected cost—either in query count or entropy.

3. Performance Evaluation and Theoretical Guarantees

Performance of query-based split algorithms is quantified via expected query counts, entropy, and distributional parameters. Empirical validation includes:

Random and Real-World Data: For group identification, GISA significantly reduces $\mathbb{E}[K]$ , with results approaching group entropy $H(\pi_y)$ as inter-group and intra-group parameters (e.g., $\gamma_w$ , $\gamma_b$ ) improve.
Practical Systems Evaluation: In toxic chemical identification tasks (WISER), GISA achieved $\mathbb{E}[K]\approx 7.79$ compared to $7.95$ for GBS and $\approx 16.33$ for random search.
Noisy Settings: In cases of persistent noise, group identification with cost-based splitting outperformed standard approaches when the error levels were nonextreme.

Results are typically presented in comparative tables (e.g., reporting $\mathbb{E}[K]$ values, confidence intervals) and scatter or curve plots showing query efficiency as a function of data characteristics.

4. Connections to Shannon–Fano Coding and Information Theory

Theoretical grounding for query-based splitting derives from classical coding theory:

Shannon–Fano Coding Analogy: Standard decision tree-based splitting is formally equivalent to constructing optimal prefix codes, with expected code length bounded below by entropy $H(\pi)$ . A perfectly balanced split ( $\rho=0.5$ ) yields query counts matching the entropy.
Generalization to Groups: In group-based settings, cost functions incorporate additional terms penalizing splits that fail to keep groups together, mathematically extending the coding analogy to the case where symbol groups and their distributions must be preserved:

$\mathbb{E}[K]=H(\pi)+\sum_{a}\pi(\Theta_a)[1-H(\rho_a)]$

and, for groups,

$\sum_{i=1}^m\frac{\pi(\Theta^i_a)}{\pi(\Theta_a)} H(\rho^i_a)$

This theory ensures that splitting algorithms balance both overall probability mass and group integrity, which is critical for rapid group-level decisions.

5. Practical and Real-World Applications

Query-based data splits have broad practical utility in environments where structured, rapid identification is needed:

Emergency Response and Toxic Chemical Identification: Enables first responders to quickly pinpoint chemical groups via symptom-based queries, optimizing decision paths and reducing query burden.
Network Fault Diagnosis/System Troubleshooting: Group-based splitting narrows faults to subnets or modules, directing systematic troubleshooting.
Interactive Web Search/Rapid Information Extraction: Algorithms can guide users through grouped queries to efficiently reduce search scope.
Human-in-the-Loop Questionnaires/Evaluation: Presenting query groups leverages human strengths in recognizing coherent sets, improving both performance and usability.

Further use cases extend to systems where minimizing user burden and maximizing query efficiency are critical, such as online diagnostics, session-based recommendation, or batch search optimization.

6. Broader Implications and Integration with Other Query-Based Split Frameworks

The group-based query splitting framework informs several related areas:

Workload-driven vertical partitioning for query processing (Zhao et al., 2015): Data is split based on query workloads, optimizing for raw data access and loaded partition storage.
Stratified and similarity-based splits for classifier training (Farias et al., 2020): Data splits are optimized for representativeness not just in labels, but also in feature space, mitigating distribution mismatch.
Evolutionary and multi-objective split optimization (Florez-Revuelta, 2021): Data splits are generated to preserve both label and label-pair distributions, directly impacting query-targeted classifier training and evaluation.
Benchmarking splits for fair and reproducible evaluation (Nwoye et al., 2022): Dataset splits and metrics are standardized to enable objective comparisons and progress tracking.
Advanced split algorithms for statistical similarity and inference (Vakayil et al., 2021, Leiner et al., 2021): Techniques such as Twinning or data fission generate splits that ensure statistical similarity or explicit separation of information for selection versus inference tasks.

The unifying theme is that splitting strategies rooted in query semantics and group structure yield superior outcomes compared to random or naïve splits, whether the goal is optimization of query count, distributional balance, inference accuracy, or real-world deployment constraints.