Papers
Topics
Authors
Recent
Search
2000 character limit reached

Unweighted Influence Data Subsampling

Updated 16 January 2026
  • UIDS is a data reduction technique that leverages influence functions to select subsamples with maximal statistical fidelity while enabling unweighted training.
  • The methodology employs strategies such as greedy batch selection, probabilistic subsampling, and pseudo-observation construction to enhance efficiency and model risk reduction.
  • UIDS offers rigorous theoretical guarantees—risk improvement, distributional robustness, and asymptotic efficiency—demonstrating superior performance over random and leverage-based methods.

Unweighted Influence Data Subsampling (UIDS) is a principled data reduction methodology that leverages influence functions to select subsamples providing maximal statistical fidelity to the full dataset, under the constraint that final training is unweighted. UIDS is designed to substantially reduce computation burden for large-scale model fitting and model selection, outperform random and leverage-based sampling, and in certain cases—under well-characterized conditions—even surpass full-data ERM performance on external validation (Wang et al., 2019, Raj et al., 2020, Ting et al., 2017).

1. Influence Functions: Formal Definitions and Roles

UIDS is fundamentally predicated on influence functions, which quantify the infinitesimal effect of individual data points on model estimators. Let Z={zi}i=1NZ = \{z_i\}_{i=1}^{N} be a dataset, and let θ^(Z)=argminθi=1N(zi,θ)\hat\theta(Z) = \arg\min_\theta \sum_{i=1}^N \ell(z_i, \theta) denote the empirical risk minimizer. The influence of a training point zz is captured via the Bouligand Influence Function (BIF):

BIF(z;θ^,Z)=limε0θ^((1ε)Z+ε{z})θ^(Z)ε\mathrm{BIF}(z; \hat\theta, Z) = \lim_{\varepsilon \to 0} \frac{\hat\theta((1-\varepsilon)Z + \varepsilon \{z\}) - \hat\theta(Z)}{\varepsilon}

Under suitable smoothness (twice differentiable loss, non-degenerate Hessian), one obtains for M-estimators:

BIF(z;θ^,Z)=HZ1θ(z,θ^(Z))\mathrm{BIF}(z; \hat\theta, Z) = -H_Z^{-1} \nabla_\theta \ell(z, \hat\theta(Z))

where HZ=i=1Nθ2(zi,θ^(Z))H_Z = \sum_{i=1}^N \nabla^2_\theta \ell(z_i, \hat\theta(Z)) is the empirical-risk Hessian at θ^(Z)\hat\theta(Z) (Raj et al., 2020).

Further refinement yields test-risk influence functions, describing the effect of infinitesimal up-weighting of ziz_i on an arbitrary future risk R(θ)R(\theta)—including direct test set (Q′) loss (Wang et al., 2019). The total influence of ziz_i on Q′ is

ϕi=(1/m)j=1m(θl(θ;zj))H1θl(θ;zi)\phi_i = -\left(1/m\right) \sum_{j=1}^m \left(\nabla_\theta l(\theta; z_j)\right)^\top H^{-1} \nabla_\theta l(\theta; z_i)

This influence summary enables subsample selection based on predictions of model performance shifts.

2. UIDS Algorithmic Frameworks and Sampling Criteria

UIDS is instantiated via several algorithmic paradigms arising from the influence analysis. These include:

  • Greedy Batch Selection (Validation-Loss-Guided): At each epoch, compute influence scores Δ(z)\Delta(z), predicting the decrease in validation loss from adding point zz. Batch the mm highest-Δ(z)\Delta(z) points into the subsample. Optionally, ϵ\epsilon-greedy heuristics mix in random selections to mitigate overfitting (Raj et al., 2020).
    1
    2
    3
    4
    5
    6
    7
    
    # Simplified pseudocode: see [2010.10218] for detailed steps
    for t in range(T):
        Fit θ_S via ERM on S
        Compute H_S, its inverse
        Precompute grad = _{vV} θℓ(v,θ_S)
        For z in ZS: score Δ(z) =  gradᵀ (H_S¹ θℓ(z,θ_S))
        Add top-m Δ(z) points to S
  • Probabilistic Subsampling (Smooth Influence Map): Compute per-point influences ϕi\phi_i and derive smooth probabilities πi\pi_i for inclusion using decreasing maps (e.g., linear or sigmoid). Subsample independently, retrain on S unweighted (Wang et al., 2019).
    1
    2
    3
    4
    5
    6
    
    # Key steps for smooth probabilistic UIDS
    Compute full-set ERM, θ̂
    For i, φ_i  total influence of z_i
    π_i  π(φ_i)   # e.g., π(φ_i) = clip(1 – α·φ_i, 0, 1) or sigmoid
    S = {i: O_i  Bernoulli(π_i)}
    Retrain θ̃ on S
  • Pseudo-Observation Construction (Asymptotically Linear Estimators): For M-estimators and their generalizations, compute influence vectors ψi\psi_i at a pilot θ0\theta_0, subsample with probabilities proportional to ψi\|\psi_i\|, and correct each subsampled instance with ψi/pi\psi_i/p_i. Final estimator averages pseudo-influences unweighted (Ting et al., 2017).

UIDS requires only unweighted training on the selected subset or pseudo-sample, facilitating downstream software compatibility and enabling direct empirical comparisons.

3. Theory: Risk, Robustness, and Performance Bounds

UIDS enjoys rigorous theoretical guarantees substantiated by the following results:

  • Risk Improvement Over Full Data: Choosing subsamples via influences can yield models that outperform the full-data ERM on external risk R(θ)R(\theta), provided the sampling perturbations are negatively correlated with influence scores. Lemma: If Cov(ϕ,ϵ)0\mathrm{Cov}(\phi, \epsilon) \leq 0, then R(θϵ;Q)R(θ;Q)R(\theta_\epsilon; Q') \leq R(\theta; Q') (Wang et al., 2019).
  • Distributional Robustness: By formulating the worst-case risk over a χ2\chi^2-ball around the empirical distribution, probabilistic UIDS proves that worst-case out-of-sample risk is Lipschitz in the influence weights (Theorem 3 and 4 in (Wang et al., 2019)). Smoothed sampling maps guarantee limited parameter drift and guard against overfitting on validation splits.
  • Asymptotic Efficiency: For asymptotically linear estimators, the UIDS estimator achieves minimal possible variance among all Poisson subsampling schemes of fixed size nn, matching Horvitz-Thompson correction while avoiding weighted fitting in practice (Ting et al., 2017).

In summary, provided variability of influences, UIDS strictly outperforms or matches random sampling and leverage-score procedures on estimation accuracy and risk reduction.

4. Computational Complexity, Implementation, and Constraints

UIDS has characteristic computational and memory profiles:

  • Greedy batch UIDS: Dominant cost is Hessian inversion (O(d3)O(d^3) worst case). Practical implementation leverages Hessian-vector products and conjugate gradients (O(d2)O(d^2) per score). Batch selection and gradient computation are O(Nd)O(Nd) per round. For large dd, matrix sketching or Hessian-free approaches are recommended (Raj et al., 2020).
  • Smooth probabilistic sampling: The influence-vector computation uses preconditioned CG for solving Hs=gHs=g, scaling as O(knd)O(kn d) for kk CG steps, with subset retraining cost O(rndT)O(r n d T') (Wang et al., 2019).
  • Pseudo-observation construction: Pilot estimation, influence computation, and unweighted averaging scale as O(Nd)O(Nd) in typical regression settings, with memory requirement for storing nn influence values and dd model parameters (Ting et al., 2017).

UIDS is generally applicable to differentiable models (logistic/linear regression, generalized linear models, etc.). For non-differentiable models (e.g., tree ensembles), influence-based subsampling using differentiable surrogates allows for effective model transfer (Raj et al., 2020).

UIDS performance is sensitive to:

  • Initial seed size (MdM \approx d for well-conditioning),
  • Batch size (mMm \ll M for fine control),
  • Validation set composition,
  • Influence variability across the data.

For highly non-convex losses (deep nets), Hessian invertibility and local linearity assumptions may be violated, limiting applicability.

5. Empirical Results and Benchmark Comparisons

Experimental evaluations demonstrate that UIDS consistently achieves higher accuracy or lower error using smaller data subsets compared to uniform random sampling, leverage-based subsampling, and weighted influence methods.

Key empirical findings

Dataset Metric UIDS vs. Random
Amazon (LR) Accuracy 0.90 with ≈500 points (UIDS); >1000 (random) (Raj et al., 2020)
MNIST (LR) Accuracy >10% speedup for UIDS (Raj et al., 2020)
California Housing RMSE 1.5 (UIDS) vs. 2.0 (random) (Raj et al., 2020)
UCI Breast Out-sample log-loss 0.0803 (Sig-UIDS) vs. 0.0914 (full-data) (Wang et al., 2019)

Results on large-scale industrial datasets (108\sim 10^8 samples) confirm practical feasibility, with influence computations completed in minutes on standard hardware (Wang et al., 2019).

Dropout-style hard selection (keep ϕi0\phi_i \leq 0) can lead to overfitting and large parameter shifts, degrading out-of-sample performance. Smooth probabilistic selection (Sig-UIDS, Lin-UIDS) maintains stability and generalization (Wang et al., 2019).

UIDS generalizes and unifies several non-uniform subsampling strategies:

  • Leverage-score sampling: Targets data covariance geometry but ignores residual impact; UIDS incorporates both, improving upon this approach (Ting et al., 2017).
  • Weighted influence sampling: Horvitz-Thompson corrections require weighted fitting; UIDS achieves equivalent asymptotic properties but retains unweighted case, simplifying downstream training (Ting et al., 2017).
  • Data dropout: Deterministic dropping based on influence can improve risk but is brittle and less robust to external distribution shifts than smooth UIDS variants (Wang et al., 2019).

UIDS stands out for strict theoretical guarantees of risk reduction and empirical efficiency gains.

7. Practical Considerations and Limitations

Implementing UIDS requires attention to numerical issues and problem-specific characteristics:

  • Scalability: Influence computation dominates for high-dimensional settings; matrix sketching, Hessian-free methods, and batching are essential.
  • Model Classes: UIDS is effective for convex, twice-differentiable ERM problems. For tree ensembles or non-differentiable architectures, proxy modelling is recommended.
  • Robustness: Smoothing the sampling map (using linear or sigmoid functions) ensures generalization and prevents over-confident subsample overfitting to the validation set.

A plausible implication is that in domains where individual point influence varies widely, UIDS can substantially accelerate model selection and hyperparameter optimization (Raj et al., 2020). However, effectiveness degrades if most training points are similarly influential or when local linearity assumptions are violated.

UIDS represents a rigorously justified, empirically validated approach for data reduction in large-scale model selection, offering unique advantages over competing subsampling techniques.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (3)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Unweighted Influence Data Subsampling (UIDS).