Minimal Sufficient Subset Selection

Updated 23 December 2025

Minimal Sufficient Subset Selection is the task of choosing the smallest set from a larger collection that retains essential predictive, explanatory, or representational properties.
It leverages properties like submodularity and supermodularity with greedy and heuristic algorithms, achieving near-optimal solutions despite NP-hard challenges.
Practical approaches balance computational tractability and accuracy, with applications in feature selection, data subset selection, and model interpretability across high-dimensional settings.

Minimal Sufficient Subset Selection is the task of identifying the smallest subset from a larger collection (variables, input regions, data points, or solutions) that retains essential predictive, explanatory, or representational power with respect to a target objective or model. This combinatorial problem appears across statistics, machine learning, optimization, and interpretability, with formulations that are often NP-hard. Modern approaches balance computational tractability against theoretical optimality, leveraging supermodular/submodular properties, approximation algorithms, and structural assumptions to obtain near-optimal solutions in high-dimensional regimes or under model constraints.

1. Formal Problem Definitions and Applications

Minimal sufficient subset selection is instantiated through diverse formalizations according to application context:

Linear Models and Feature Selection: The classical regression subset selection problem seeks the smallest variable set $S$ yielding minimal loss (e.g., mean square error, mean absolute error) when fitting $y\approx X_S\beta+\mu$ with $|S|\leq k$ (Park et al., 2017, Pia et al., 2018).
Column Subset Selection (CSS): Given data matrix $M$ , find $k$ columns whose span approximates $M$ best in Frobenius norm (Shitov, 2017, Sood et al., 2023). The statistical analogue is principal variable selection via minimizing the trace of the residual covariance.
Inference in GMRFs: In Gaussian Markov random fields, the objective is to choose observed nodes $S$ so the conditional variance of the unobserved nodes $U$ is minimized: $f(S) = \operatorname{Tr}(\Theta_{U,U}^{-1})$ (Mahalanabis et al., 2012).
Data Subset Selection: In computer vision, the goal is to select a small representative or diverse subset $S$ of unlabeled/labeled data to maximize accuracy or efficiency, using facility-location (coverage) or disparity-min (diversity) objectives (Kaushal et al., 2018, Kaushal et al., 2019).
Attribution and Counterfactual Explanations: In black-box models, identify the smallest set of input regions $S$ such that ablating $S$ significantly changes the model's prediction (“minimal cause” or “sufficient explanation”) (Chen et al., 1 Apr 2025, Chen et al., 15 Nov 2025).
Parameter-efficient Fine-tuning: Select a minimal subset of trainable parameter groups in large models to maximize performance for a fixed parameter budget (Xu et al., 18 May 2025).
Multi-objective Optimization: From a Pareto front, select a k-sized subset that minimizes expected loss for a decision maker, reducing cognitive load while maintaining solution quality (Ishibuchi et al., 2020).
Information-theoretic Feature Selection: Select a minimal set of features minimizing the conditional entropy of a target variable, subject to confidence bounds (Romero et al., 31 Oct 2025).

2. Computational Hardness and Theoretical Guarantees

Most minimal sufficient subset selection problems are NP-hard:

Exact Subset Selection: Minimizing subset cardinality for a fixed error, or minimizing error under a fixed size, is NP-hard for linear regression, CSS, and Gaussian Markov fields (Shitov, 2017, Pia et al., 2018, Mahalanabis et al., 2012). For CSS, NP-completeness is proved by reduction from graph 3-coloring (Shitov, 2017).
Feature Selection via Conditional Entropy: The combinatorial minimization of $H(Y|X_S)$ over $S\subseteq\{1,\ldots,d\}$ is NP-complete; efficient heuristics provide empirical correctness with high probability (Romero et al., 31 Oct 2025).
Approximability: Submodular and supermodular objectives enable greedy algorithms with classical $(1-1/e)$ or $1/2$ approximation guarantees (Kaushal et al., 2018, Mahalanabis et al., 2012, Chen et al., 1 Apr 2025).

Table: Complexity and Guarantees for Key Problems

Problem Type	NP-hard?	Greedy Approximation
Linear regression feature selection	Yes	No (in general)
Column subset selection (CSS)	Yes	No (in general)
GMRF subset observation	Yes	$(1-1/e)$ (GFFs)
Facility-location/data subset selection	Yes	$(1-1/e)$
Disparity-min/data diversity	Yes	$1/2$
Submodular-based attribution/causal sets	Yes	$(1-1/e)$

3. Submodular and Supermodular Optimization Strategies

Many modern frameworks leverage submodularity or supermodularity for efficient approximate selection:

Facility-Location: Submodular, monotone, suited for selecting representative/coverset subsets. Greedy maximization guarantees $(1-1/e)$ -optimality (Kaushal et al., 2018, Kaushal et al., 2019).
Disparity-Min (Dispersion): Not fully submodular, but greedy construction achieves $1/2$-approximation. Promotes diversity by maximizing the minimum pairwise distance (Kaushal et al., 2018, Kaushal et al., 2019).
Supermodular Minimization in GMRFs: The prediction error function $f(S)$ in GFFs is supermodular and nonincreasing, enabling greedy $(1-1/e)$ -approximations (Mahalanabis et al., 2012).
Submodular Attribution/LiMA: LiMA constructs a composite monotone submodular function over input regions, allowing bidirectional greedy ranking and efficient interpretability with performance bounds (Chen et al., 1 Apr 2025). Counterfactual LIMA further extends this to counterfactual, minimal sufficient cause identification (Chen et al., 15 Nov 2025).

Practical optimizations deploy lazy greedy, heap-based memoization, or hybrid forward-removal algorithms to scale to $n\approx 10^5$ examples or $m\approx 10^3$ regions.

4. Specialized Algorithms and Structural Exploitation

Several works address subclasses or structural cases to achieve tractability:

Block-diagonal or Sparse Matrices: For regression problems with block-diagonal structure and fixed-size global effects, hyperplane arrangements and dynamic programming enable polynomial-time minimal subset selection over large data (Pia et al., 2018).
Bounded Tree-width in GMRFs: Observing bounded tree-width in a dependency graph, message passing and ε-net discretization yield a fully polynomial-time approximation scheme (FPTAS) for budget or cover versions of the GMRF selection problem (Mahalanabis et al., 2012).
Greedy, Swapping, and Genetic Algorithms: Greedy algorithms (e.g., Farahat’s, incremental selection) and population-based heuristics (genetic algorithms with repair and dedicated fitness) are commonly employed for practical optimization of the combinatorial objective in data and solution subset selection (Sood et al., 2023, Ishibuchi et al., 2020).

Problem-specific constraints, such as independence or block separability, rigorously circumscribe the class of tractable instances.

5. Empirical Performance and Selection Criteria

Empirical studies across applications clarify the minimal sufficiency—often as the "elbow" or saturation point in accuracy/error/budget trade-off curves:

Data Subset Selection: Only $30$– $50\%$ of the original data, selected by facility-location, suffices for $k$ -NN or neural models to reach within $1$– $3\%$ of full-data accuracy; in active learning, minimal sufficient sets lead to $2$– $10\%$ accuracy improvements with reduced labeling (Kaushal et al., 2018, Kaushal et al., 2019).
Attribution: Minimal sufficient input region sets (computed via LiMA or Counterfactual LIMA) are, on average, $30$– $70\%$ smaller than baseline methods for the same level of decision fidelity, improving insertion/deletion AUCs by $36$– $40\%$ and speeding up attribution by $1.6\times$ (Chen et al., 1 Apr 2025, Chen et al., 15 Nov 2025).
Decision Support in Multi-objective Optimization: Presenting decision makers with a minimal $k$ -set selected by minimizing expected loss (equivalently, IGD $^+$ ) provides dense “knee” coverage of the Pareto front, focusing attention on best-compromise solutions (Ishibuchi et al., 2020).
Parameter-efficient Fine-tuning: Hessian-informed selection identifies less than $1\%$ of parameters required to attain—or surpass—full fine-tuning performance, outperforming gradient-based and random PEFT strategies (Xu et al., 18 May 2025).

Selection of $k$ (subset cardinality) is typically guided by plotting performance versus size and identifying a diminishing-returns elbow; statistical tests, cross-validation, or user-imposed budgets may further regularize this choice (Sood et al., 2023, Mahalanabis et al., 2012, Romero et al., 31 Oct 2025).

6. Theoretical and Algorithmic Limiting Cases

Fundamental limits persist across frameworks:

Information-theoretic and Statistical Limitation: Even optimal algorithms under classic settings (e.g., PCSS model, subset-factor models) are limited by signal strength, sample-size-to-dimension ratio, and eigenvalue separation. Under reasonable spiked covariance regimes, high-probability consistent recovery of the true minimal sufficient subset is possible (Sood et al., 2023).
Intractability: For feature selection, CSS, or GMRF observation, unless $P=NP$ , no polynomial algorithm can guarantee discovery of the size- $k$ minimizer in the general setting. Structural assumptions (bounded degree, block-diagonal, or tree-width) and sub/supermodularity are vital for feasible computation (Shitov, 2017, Pia et al., 2018, Mahalanabis et al., 2012).
Approximation Thresholds: Nemhauser/Wolsey–type lower bounds (e.g., $(1-1/e)$ ) for submodular maximization, and $1/2$ for non-submodular diverse set selection, delineate the best-known practical guarantees (Mahalanabis et al., 2012, Kaushal et al., 2018, Chen et al., 1 Apr 2025).

7. Extensions, Open Problems, and Current Directions

Recent work generalizes the concept to parameter selection, attribution, or solution subset selection:

Integrating Counterfactual Explanations in Training: SS-CA uses counterfactual masks identified by minimal sufficient subsets to augment data, correcting overfitting to “shortcut” features, thereby improving robustness and generalization (Chen et al., 15 Nov 2025).
Knapsack-based Multi-objective Selection: Hessian-informed methods map the subset selection trade-off to classical 0–1 knapsack, achieving Pareto-optimal frontiers between resource use and performance (Xu et al., 18 May 2025).
Scalable Statistical Feature Selection: Confidence-guided entropy minimization with Cantelli bounds enables tractable, high-probability differential set recovery for independent, discrete variables (Romero et al., 31 Oct 2025).
CSS in the Presence of Missing/Censored Data: CSS can be exactly performed using only (possibly partially observed) summary statistics, with provable consistency and extensions to hypothesis-testing-based selection of $k$ (Sood et al., 2023).

Key open problems include extending theoretical guarantees to broader model classes, sharper bounds in high-dimensional or adversarial regimes, integration of interpretability with decision-theoretic optimality, and generalized algorithms handling mixed-type or weakly dependent data under tight computational budgets.