Unweighted Influence Data Subsampling
- UIDS is a data reduction technique that leverages influence functions to select subsamples with maximal statistical fidelity while enabling unweighted training.
- The methodology employs strategies such as greedy batch selection, probabilistic subsampling, and pseudo-observation construction to enhance efficiency and model risk reduction.
- UIDS offers rigorous theoretical guarantees—risk improvement, distributional robustness, and asymptotic efficiency—demonstrating superior performance over random and leverage-based methods.
Unweighted Influence Data Subsampling (UIDS) is a principled data reduction methodology that leverages influence functions to select subsamples providing maximal statistical fidelity to the full dataset, under the constraint that final training is unweighted. UIDS is designed to substantially reduce computation burden for large-scale model fitting and model selection, outperform random and leverage-based sampling, and in certain cases—under well-characterized conditions—even surpass full-data ERM performance on external validation (Wang et al., 2019, Raj et al., 2020, Ting et al., 2017).
1. Influence Functions: Formal Definitions and Roles
UIDS is fundamentally predicated on influence functions, which quantify the infinitesimal effect of individual data points on model estimators. Let be a dataset, and let denote the empirical risk minimizer. The influence of a training point is captured via the Bouligand Influence Function (BIF):
Under suitable smoothness (twice differentiable loss, non-degenerate Hessian), one obtains for M-estimators:
where is the empirical-risk Hessian at (Raj et al., 2020).
Further refinement yields test-risk influence functions, describing the effect of infinitesimal up-weighting of on an arbitrary future risk —including direct test set (Q′) loss (Wang et al., 2019). The total influence of on Q′ is
This influence summary enables subsample selection based on predictions of model performance shifts.
2. UIDS Algorithmic Frameworks and Sampling Criteria
UIDS is instantiated via several algorithmic paradigms arising from the influence analysis. These include:
- Greedy Batch Selection (Validation-Loss-Guided): At each epoch, compute influence scores , predicting the decrease in validation loss from adding point . Batch the highest- points into the subsample. Optionally, -greedy heuristics mix in random selections to mitigate overfitting (Raj et al., 2020).
1 2 3 4 5 6 7
# Simplified pseudocode: see [2010.10218] for detailed steps for t in range(T): Fit θ_S via ERM on S Compute H_S, its inverse Precompute grad = ∑_{v∈V} ∇θℓ(v,θ_S) For z in Z∖S: score Δ(z) = – gradᵀ (H_S⁻¹ ∇θℓ(z,θ_S)) Add top-m Δ(z) points to S
- Probabilistic Subsampling (Smooth Influence Map): Compute per-point influences and derive smooth probabilities for inclusion using decreasing maps (e.g., linear or sigmoid). Subsample independently, retrain on S unweighted (Wang et al., 2019).
1 2 3 4 5 6
# Key steps for smooth probabilistic UIDS Compute full-set ERM, θ̂ For i, φ_i ← total influence of z_i π_i ← π(φ_i) # e.g., π(φ_i) = clip(1 – α·φ_i, 0, 1) or sigmoid S = {i: O_i ∼ Bernoulli(π_i)} Retrain θ̃ on S
- Pseudo-Observation Construction (Asymptotically Linear Estimators): For M-estimators and their generalizations, compute influence vectors at a pilot , subsample with probabilities proportional to , and correct each subsampled instance with . Final estimator averages pseudo-influences unweighted (Ting et al., 2017).
UIDS requires only unweighted training on the selected subset or pseudo-sample, facilitating downstream software compatibility and enabling direct empirical comparisons.
3. Theory: Risk, Robustness, and Performance Bounds
UIDS enjoys rigorous theoretical guarantees substantiated by the following results:
- Risk Improvement Over Full Data: Choosing subsamples via influences can yield models that outperform the full-data ERM on external risk , provided the sampling perturbations are negatively correlated with influence scores. Lemma: If , then (Wang et al., 2019).
- Distributional Robustness: By formulating the worst-case risk over a -ball around the empirical distribution, probabilistic UIDS proves that worst-case out-of-sample risk is Lipschitz in the influence weights (Theorem 3 and 4 in (Wang et al., 2019)). Smoothed sampling maps guarantee limited parameter drift and guard against overfitting on validation splits.
- Asymptotic Efficiency: For asymptotically linear estimators, the UIDS estimator achieves minimal possible variance among all Poisson subsampling schemes of fixed size , matching Horvitz-Thompson correction while avoiding weighted fitting in practice (Ting et al., 2017).
In summary, provided variability of influences, UIDS strictly outperforms or matches random sampling and leverage-score procedures on estimation accuracy and risk reduction.
4. Computational Complexity, Implementation, and Constraints
UIDS has characteristic computational and memory profiles:
- Greedy batch UIDS: Dominant cost is Hessian inversion ( worst case). Practical implementation leverages Hessian-vector products and conjugate gradients ( per score). Batch selection and gradient computation are per round. For large , matrix sketching or Hessian-free approaches are recommended (Raj et al., 2020).
- Smooth probabilistic sampling: The influence-vector computation uses preconditioned CG for solving , scaling as for CG steps, with subset retraining cost (Wang et al., 2019).
- Pseudo-observation construction: Pilot estimation, influence computation, and unweighted averaging scale as in typical regression settings, with memory requirement for storing influence values and model parameters (Ting et al., 2017).
UIDS is generally applicable to differentiable models (logistic/linear regression, generalized linear models, etc.). For non-differentiable models (e.g., tree ensembles), influence-based subsampling using differentiable surrogates allows for effective model transfer (Raj et al., 2020).
UIDS performance is sensitive to:
- Initial seed size ( for well-conditioning),
- Batch size ( for fine control),
- Validation set composition,
- Influence variability across the data.
For highly non-convex losses (deep nets), Hessian invertibility and local linearity assumptions may be violated, limiting applicability.
5. Empirical Results and Benchmark Comparisons
Experimental evaluations demonstrate that UIDS consistently achieves higher accuracy or lower error using smaller data subsets compared to uniform random sampling, leverage-based subsampling, and weighted influence methods.
Key empirical findings
| Dataset | Metric | UIDS vs. Random |
|---|---|---|
| Amazon (LR) | Accuracy | 0.90 with ≈500 points (UIDS); >1000 (random) (Raj et al., 2020) |
| MNIST (LR) | Accuracy | >10% speedup for UIDS (Raj et al., 2020) |
| California Housing | RMSE | 1.5 (UIDS) vs. 2.0 (random) (Raj et al., 2020) |
| UCI Breast | Out-sample log-loss | 0.0803 (Sig-UIDS) vs. 0.0914 (full-data) (Wang et al., 2019) |
Results on large-scale industrial datasets ( samples) confirm practical feasibility, with influence computations completed in minutes on standard hardware (Wang et al., 2019).
Dropout-style hard selection (keep ) can lead to overfitting and large parameter shifts, degrading out-of-sample performance. Smooth probabilistic selection (Sig-UIDS, Lin-UIDS) maintains stability and generalization (Wang et al., 2019).
6. Connections to Related Subsampling Methods
UIDS generalizes and unifies several non-uniform subsampling strategies:
- Leverage-score sampling: Targets data covariance geometry but ignores residual impact; UIDS incorporates both, improving upon this approach (Ting et al., 2017).
- Weighted influence sampling: Horvitz-Thompson corrections require weighted fitting; UIDS achieves equivalent asymptotic properties but retains unweighted case, simplifying downstream training (Ting et al., 2017).
- Data dropout: Deterministic dropping based on influence can improve risk but is brittle and less robust to external distribution shifts than smooth UIDS variants (Wang et al., 2019).
UIDS stands out for strict theoretical guarantees of risk reduction and empirical efficiency gains.
7. Practical Considerations and Limitations
Implementing UIDS requires attention to numerical issues and problem-specific characteristics:
- Scalability: Influence computation dominates for high-dimensional settings; matrix sketching, Hessian-free methods, and batching are essential.
- Model Classes: UIDS is effective for convex, twice-differentiable ERM problems. For tree ensembles or non-differentiable architectures, proxy modelling is recommended.
- Robustness: Smoothing the sampling map (using linear or sigmoid functions) ensures generalization and prevents over-confident subsample overfitting to the validation set.
A plausible implication is that in domains where individual point influence varies widely, UIDS can substantially accelerate model selection and hyperparameter optimization (Raj et al., 2020). However, effectiveness degrades if most training points are similarly influential or when local linearity assumptions are violated.
UIDS represents a rigorously justified, empirically validated approach for data reduction in large-scale model selection, offering unique advantages over competing subsampling techniques.