Quality-Based Data Selection Strategies

Updated 29 December 2025

Quality-based data selection strategies are systematic methods that evaluate dataset quality using Bayesian, submodular, and sequential decision models to optimize training outcomes and resource usage.
They operate across various granularities—from individual samples to entire datasets—employing techniques like hierarchical bandits and feedback-driven feature selection to balance exploration and exploitation.
Empirical studies on benchmarks such as Digit-Five and UCI Adult demonstrate significant improvements in model accuracy and efficiency under resource constraints.

Quality-based data selection strategies encompass algorithmic methods for identifying, scoring, and prioritizing datasets or samples for use in model training, with the explicit goal of maximizing downstream performance subject to resource or operational constraints. These strategies operate at various granularity levels—from individual samples and features to full datasets and data sources—and are particularly important in heterogeneous, multi-source, and resource-constrained environments. The field is shaped by principles from Bayesian bandits, submodular optimization, sequential decision-making, and robust estimation, and spans both theoretical guarantees and scalable, practical implementations.

1. Formalization of Quality-Based Data Selection

Quality-based data selection is typically framed as an optimization problem over a heterogeneous pool of datasets or instances. The objective is to maximize a utility—often measured as downstream model accuracy gain—under constraints such as bandwidth, labeling budget, or time. Consider the canonical dataset selection problem as formalized in (Zhou et al., 11 Dec 2025):

Given a collection of dataset groups $\mathcal{g} = \{g_1,\ldots,g_n\}$ , where group $g_i$ contains datasets $d_{i,1}, \ldots, d_{i,m_i}$ , and a total external pool $\mathcal{D} = \bigcup_i \{d_{i,j}\}$ , the aim is to select a subset $\tilde{\mathcal{D}}_k \subseteq \mathcal{D}$ for a local model $M_k$ (with its own data $d_k$ ), maximizing

$\Delta Acc_k = \max_{\tilde{\mathcal{D}}_k \subseteq \mathcal{D}} \bigl[Acc(M_k, d_k \cup \tilde{\mathcal{D}}_k) - Acc(M_k, d_k)\bigr]$

subject to $\text{Cost}(\tilde{\mathcal{D}}_k)\leq B$ .

In feature selection for streaming regimes, the analogous control variable is the sequence of features chosen under acquisition and feedback constraints, seeking long-term optimality of model utility (Sahin et al., 2020).

Quality evaluation can be at the sample, feature, dataset, or data-source level, and definitions of quality vary accordingly (see Section 3).

2. Architectures and Algorithmic Frameworks

A range of algorithmic frameworks underpin quality-based data selection, each tailored to the structural and operational constraints of the domain.

2.1 Hierarchical Bayesian Bandits

The DaSH methodology models dataset selection as a two-level Bayesian bandit problem. At the group level, each source group is modeled with a latent utility $\theta_i \sim \mathcal{N}(\mu_i, \sigma_i^2)$ . Each constituent dataset $d_{i,j}$ inherits a utility $\theta_{i,j} \sim \mathcal{N}(\theta_i, \hat\sigma_i^2)$ . Observed rewards $r_{i,j}(t)$ are generated as $r_{i,j}(t) | \theta_{i,j} \sim \mathcal{N}(\theta_{i,j}, \sigma_r^2)$ . Thompson sampling proceeds in two stages: select a group via posterior sampling, then a dataset within the group, updating closed-form Gaussian posteriors (Zhou et al., 11 Dec 2025).

2.2 Feedback-Driven Feature Selection

In dynamic systems, feature selection is formulated as an MDP with delayed rewards. At each step, $k$ features are chosen incrementally; upon acquisition, the model is updated, and rewards integrate immediate model improvement with an exploration bonus analogous to UCB. Action selection follows $\epsilon$ -greedy or its decayed variants, balancing exploitation and exploration under constrained acquisition (Sahin et al., 2020).

2.3 Submodular and Sequential Decision Models

Theoretical guarantees for selection algorithms rely on submodular utility functions: if utility $U(\cdot)$ satisfies monotonicity and diminishing returns, greedy or myopic strategies (as in the Data Shapley framework (Chi et al., 6 Feb 2025)) achieve $(1-c)^2$ optimality, where $c$ is the curvature. Advanced surrogates employ bipartite graphs for structured coverage, mapping training samples to validation coverage under learned weights, enabling fast priority-based greedy selection (Chi et al., 6 Feb 2025).

3. Quality Metrics and Estimation

Measurement and operationalization of "quality" is a central axis of these strategies, with domain-specific instantiations:

Approach	Metric/Estimator	Granularity
DaSH (Zhou et al., 11 Dec 2025)	Posterior mean of reward	Dataset, group
FBFS (Sahin et al., 2020)	F1-score (feedback), UCB bonus	Feature
Submodular (Chi et al., 6 Feb 2025)	Marginal utility, Shapley value	Sample
ERI/EAS/COI (Swazinna et al., 2021)	Relative return, action entropy	Dataset

For example, ERI is the estimated relative return improvement for an RL dataset: $\text{ERI} = \frac{R^{max} - \bar{R}^{data}}{\bar{R}^{data}}$ where $R^{max}$ is the best trajectory return, and $\bar{R}^{data}$ is the average.

Statistical rewards, cross-validation F1, or coverage-based metrics are employed depending on the underlying learning problem, with generalized capacity for side information (metadata, embedding-based prior scores) in Bayesian or contextual variants.

4. Practical Algorithms and Complexity

Algorithmic instantiations aim for computational scalability with provable efficiency:

DaSH: Per-step computational complexity is $O(n + m_{i^*})$ , with constant amortized cost as pool size grows.
FBFS: Sequential selection, model retraining, and feedback updates are batched efficiently, maintaining near-unconstrained performance with only $k\ll d$ features per epoch.
Submodular Greedy and Bipartite Graphs: Greedy with edge-prioritization exploits submodularity, while bipartite surrogates, after precomputing $O(nm)$ edge weights, yield $O(n|E|)$ selection passes, dramatically reducing the need for repeated model retraining.

Resource constraints are encoded via acquisition budgets, per-step cost caps, or proxy limits (e.g., thresholding posterior means at the $x$ -th percentile).

5. Empirical Results and Robustness Analyses

Experimental evaluation consistently demonstrates the superiority of quality-based strategies over uniform or flat selection, especially under tight resource budgets or heterogeneous dataset quality.

On Digit-Five, DaSH achieves 78.3% average accuracy, outperforming flat and core-set baselines by up to +26.2%. Performance is within 0.5% of an upper-bound "Global" selector (which presumes full access) (Zhou et al., 11 Dec 2025).
On UCI Adult, the FBFS method achieves F1-scores of 0.512—substantially above random or fixed-feature baselines—while remaining competitive with unconstrained acquisition (Sahin et al., 2020).
Submodular, curvature-aware selections consistently outperform random or LOO-based selections, with robust test accuracy and low variance, especially in high-curvature datasets prone to redundancy (Chi et al., 6 Feb 2025).

Robustness is further exhibited in settings where relevant datasets are absent: the posterior means remain low, and DaSH refrains from over-committing, providing conservative "no gain" signals.

6. Extensions, Adaptations, and Limitations

Hierarchical and submodular structures admit numerous extensions:

Multi-Objective Selection: Reward vectors or augmented signals can balance fairness, domain coverage, or custom metrics (Zhou et al., 11 Dec 2025).
Non-Stationary and Streaming Data: Online updates, nonstationary bandits, or sliding-window posteriors handle dynamic dataset availability or evolving group membership.
Side-Information Integration: Prior means in the Bayesian framework can incorporate metadata, similarity scores, or contextual relevance (Zhou et al., 11 Dec 2025).
RL and Data Source Selection: Submodular approaches extend to streaming or evolving data quality, sliding windows, or multi-constraint objectives (latency, privacy), with greedy approximations retaining performance guarantees (Lin et al., 2016).

Limitations include computational overhead in influence-function-based metrics, brittleness to noise in quality estimation, and occasional insensitivity to rare state coverage or adversarial data. Curvature analysis reveals that highly substitutable data (high $c$ ) erodes the gains from marginal utility-based methods (Chi et al., 6 Feb 2025).

7. Implications and Best Practices

Best practices derived from the empirical and theoretical literature include:

Leverage hierarchical or group structure to efficiently amortize exploration and generalize from sparse feedback.
Use closed-form Bayesian updates or greedy submodular approximations to maximize utility per acquisition, especially under resource constraints.
Augment simple quality metrics with diversity or coverage considerations for robust performance in the presence of overlapping or redundant data sources.
Explicitly integrate side information at the prior or utility function level to steer selection toward domain- or model-relevant datasets.
Monitor posterior or utility signals for early stopping or "no gain" indications, avoiding over-exploration of poor-quality sources.

Quality-based data selection strategies, when rigorously formulated and carefully implemented, are capable of achieving near-optimal model accuracy with minimal data acquisition, scalable to large and heterogeneous real-world settings (Zhou et al., 11 Dec 2025, Sahin et al., 2020, Chi et al., 6 Feb 2025, Swazinna et al., 2021, Lin et al., 2016).