Optimal Subset Repair Problem

Updated 27 December 2025

The Optimal Subset Repair Problem is a fundamental task in data management and coding theory that seeks a maximal consistent subset by minimally deleting elements to satisfy integrity constraints.
It employs algorithmic techniques like consensus FDs, common-LHS, and LHS marriage rules to achieve efficient repair in certain tractable cases while highlighting cases that are NP-hard or APX-complete.
The problem spans applications from relational databases and graph structures to distributed storage, with extensions addressing weights, duplicates, and fairness constraints for practical implementations.

The Optimal Subset Repair Problem is a foundational task in data management, coding theory, and knowledge representation that seeks to restore consistency with respect to integrity constraints by minimally deleting elements—tuples in a relational table, nodes/edges in a data-graph, or symbols in a storage array. The formal goal is to find, among all possible subsets of the original structure, a consistent subset of maximum cardinality (or, equivalently, one that minimizes the deletion cost), subject to the relevant set of integrity constraints. The combinatorial and algorithmic structure of this problem is strongly determined by the nature and interaction of the constraints (e.g., functional dependencies, denial constraints, path or node constraints in graphs), the optional presence of weights, and additional fairness, preference, or representation conditions. Complexity exhibits sharp dichotomies—certain constraint families admit efficient optimal repair algorithms, while other cases are intrinsically APX-hard or worse.

1. Formal Definitions and Canonical Variants

The optimal subset repair (optimal S-repair) problem, in the relational model, is defined as follows. Let $R(A_1,\ldots,A_k)$ be a schema, $T$ a table associating each tuple-id $i$ with a tuple $T[i]\in Val^k$ and weight $w_T(i)>0$ . Given a set of integrity constraints (typically functional dependencies, or FDs) $\Delta=\{X\to Y\}$ , a subset $S\subseteq T$ is said to be consistent if it satisfies all FDs in $\Delta$ . The deletion-distance is $\mathrm{dist}_{sub}(S,T)=\sum_{i\in Ids(T)\setminus Ids(S)} w_T(i)$ . An optimal S-repair $S^*$ is a consistent subset minimizing $\mathrm{dist}_{sub}(S,T)$ among all consistent subsets, i.e., one of largest total weight.

More generalized forms arise under denial constraints (DCs), for which conflicts are defined pairwise: a minimal removal set is conflict-free (removal of at least one tuple in each violating pair) and minimal, and optimization criteria may broaden to maximizing a defined conformance score, not just cardinality (Li et al., 20 Dec 2025).

In networked or graph-structured data (e.g., knowledge graphs), a subset repair is a maximal (w.r.t. inclusion) subgraph that satisfies all path/node constraints, such as those expressed in (positive fragments of) Reg-GXPath or GXPath logics (Abriola et al., 2022, Pardal et al., 14 Feb 2024). Here, optimality may refer to subgraph cardinality, total edge/vertex weights, or more complex preference structures.

In coding theory, particularly distributed storage, the "subset repair" abstraction refers to the recovery of lost symbols/nodes from any subset of the surviving nodes while minimizing communication (repair bandwidth). The "optimal" property is characterized via the cut-set bound: for MDS codes, the minimal possible download from $d$ helpers to repair $h$ erasures is $h/(d+h-k)$ times the total node size (Ye et al., 2016, Tamo et al., 2018, Con et al., 2021).

2. Algorithmic Techniques and Tractable Cases

For FD-based subset repair in relational databases, the OptSRepair algorithm (Livshits et al., 2017) identifies three main simplification rules:

Consensus FD: Remove all FDs of the form $\emptyset \to A$ , iteratively fixing the attribute to its most frequent (or weighted) value.
Common-LHS: If every FD features a common attribute in the LHS, partition by values of that attribute and recurse on each block.
LHS Marriage: If two distinct LHS with the same closure cover all FDs, decompose into bipartite matching plus recursive calls.

If none of these can be applied, the problem is APX-complete. When recursively exhausted, these rules enable polynomial-time optimal repair computation, even with weights and duplicates.

Graph and knowledge-base subset repair under positive node constraints (e.g., positive GXPath node expressions) also admit polynomial-time "peeling" algorithms based on monotonicity. One iteratively removes all nodes/edges violating any positive constraint, exploiting the property that positive constraints never reactivate violations by deletion (Abriola et al., 2022, Pardal et al., 14 Feb 2024).

In coding theory, explicit constructions of MDS array codes and Reed-Solomon codes achieving the optimal cut-set bound are available for wide parameter regimes. For array codes, high-rate explicit families are constructed so that any $h\leq r$ failed nodes can be exactly repaired from any $d$ helpers, with repair bandwidth meeting the cut-set bound for all parameters. This is achieved by sophisticated use of block structure, Vandermonde matrices, and algebraic symmetries (Ye et al., 2016). For RS codes, both linear (using super-exponential sub-packetization) and nonlinear (additive combinatorics-based) repair schemes have been proposed to meet or asymptotically approach the cut-set lower bound (Tamo et al., 2018, Con et al., 2021).

3. Complexity Landscape and Dichotomy Theorems

A comprehensive dichotomy is established for S-repair under FDs: OSRSucceeds(Δ) returns true if and only if rigorous application of the above simplifications reduces Δ to triviality, and in this case, optimal S-repair is polynomial-time computable for any instance. Otherwise, the problem is APX-complete and cannot be approximated within a constant factor better than 17/16 (or even tighter constants for certain patterns), barring P=NP (Livshits et al., 2017, Miao et al., 2020). The classification is robust to weighted tuples and duplicates.

For subset repairs under Reg-GXPath constraints in graph data, polynomial-time algorithms exist for the positive fragment (monotonicity allows greedy peeling), but NP-completeness emerges as soon as one allows positive path constraints or arbitrary node constraints (Abriola et al., 2022, Pardal et al., 14 Feb 2024). Undecidability manifests if unrestricted superset repairs or complements are allowed.

Table: Complexity for FD-based S-repair (Livshits et al., 2017)

FD Class	Complexity	Comment
Trivial FDs	polytime	No deletions needed
Consensus FD	polytime	Max-weight block
Common-LHS	polytime	Partitioning
LHS-marriage	polytime	Bipartite matching
Reducible to above	polytime	Complete reduction possible
All others	APX-complete	NP-hard to approximate

4. Extensions: Weighted, Duplicate, and Fair Repairs

Weighted versions allow each tuple or edge to have a deletion cost; all the decomposition and recursion steps above are weight-respecting (Livshits et al., 2017). The presence of duplicate tuples or edges does not affect algorithmic soundness or complexity class.

Recent work has added representation constraints (RC)—for example, enforcing that the repaired output must maintain specified lower bounds (or exact fractions) for sensitive subpopulations. The complexity increases sharply: even for a single nontrivial exact fraction RC and a trivial FD set, optimal S-repair is NP-hard (Liu et al., 21 Oct 2024). However, for LHS-chain FDs and bounded domain, dynamic programming and candidate set reduction yield polynomial-time RS-repair algorithms. ILP-based formulations and practical heuristics (e.g., FDcleanser, LP+GreedyRounding) are applied for general settings to trade optimality for scalability (Liu et al., 21 Oct 2024).

5. Optimal Subset Repair in Graph and Code Domains

For graph databases, subset repair generalizes to selecting maximal subgraphs (vertices and/or labeled edges) satisfying integrity constraints expressible in GXPath or Reg-GXPath logics. Under positive node constraints, the unique maximal repair is computable in PTIME; for more expressive constraint classes or preferences (e.g., cardinality, weighted, lexicographic), the existence and search problems are NP-complete or FNP-complete, while consistent query answering rises to ΘP₂ or ΠP₂-complete (Pardal et al., 14 Feb 2024). For prioritized, weighted, multiset, or cardinality-based optimizations, repair-finding algorithms either leverage preference-aware greedy or binary search within an NP-oracle framework.

In coding theory, the "subset repair" property is realized by achieving the information-theoretic optimal repair bandwidth for any repair set (subset of failed nodes and helpers). Explicit high-rate MDS array code constructions exist achieving this universally over all $h$ and $d$ , and transformations exist to yield (near-)optimal repair while preserving field size and sub-packetization to within established lower bounds (Ye et al., 2016, Li et al., 2016, Tamo et al., 2018). Nonlinear repair schemes further exploit arithmetic progressions and sumset combinatorics to attain or even slightly surpass linear schemes, especially over prime fields (Con et al., 2021).

6. Approximation Algorithms, Estimation, and Experimental Insights

When the subset repair task is APX-hard or intractable, LP relaxations, clique-based rounding, and probabilistic algorithms provide practical or provably good approximate solutions:

For FD S-repair, relaxed vertex cover LPs and triad elimination yield $2-0.5^{\sigma-1}$ -approximations, improving with bounded determinant-class structure or k-quasi-Turán properties (Miao et al., 2020).
For subset repair with dependency models (denial constraints), an exact ILP is complemented by a clique-constraint LP-based algorithm with approximation ratio η(n–m)/n (η being the score ratio), and a randomized removal strategy with expected guarantee $(\eta/2)^{2V+1}$ , where V is the conflict-graph max degree (Li et al., 20 Dec 2025).
Sublinear, sampling-based stochastic estimators for inconsistency degree enable scalable "repair cost" estimation for arbitrary subset queries, with analytically bounded error and practical performance on large datasets (Miao et al., 2020).
Empirical studies demonstrate that advanced approximation and probabilistic algorithms consistently outperform baseline value-frequency-based repairs in F1 error identification, scalability, and downstream predictive performance (Li et al., 20 Dec 2025, Liu et al., 21 Oct 2024).

7. Connections, Practical Considerations, and Open Problems

Optimal Subset Repair theory connects to vertex cover, hypergraph matching, code design, and complexity dichotomies. Practical implementations require adaptation to settings with constraints on fairness (representation), evolving integrity conditions, and application-specific distance functions. Many open questions remain on tightening approximation ratios (closing the 17/16 gap), extending dichotomies to richer constraint languages (e.g., inclusion dependencies), building dynamic or streaming repair estimators, and developing optimal heuristics for high-dimensional, constraint-dense or representation-critical domains (Miao et al., 2020, Liu et al., 21 Oct 2024).

In distributed storage, ongoing work seeks to reduce field size and sub-packetization complexity in optimal repair code constructions, and to extend optimal subset repair properties to locally repairable codes with multiple alternate repair paths (Ye et al., 2016, Tamo et al., 2018). In fairness-aware data repair, balancing representativity and information loss remains a central computational and societal challenge (Liu et al., 21 Oct 2024).