Papers
Topics
Authors
Recent
Search
2000 character limit reached

Poisson Subsampling Methods

Updated 15 April 2026
  • Poisson subsampling is a randomized technique that independently includes each data point, enabling unbiased estimation and scalable inference.
  • It optimizes inclusion probabilities to minimize estimator variance, employing both with-replacement and without-replacement schemes.
  • It underpins applications in differential privacy, streaming analytics, and coreset constructions, yielding practical efficiency and privacy gains.

Poisson subsampling is a class of randomized subsampling methods that includes each data item independently with some specified probability, resulting in a variable-size, inclusion-weighted subsample. These procedures are central to scalable inference, optimal experimental design, and privacy amplification in modern large-scale statistical and machine learning pipelines. They are implemented in both with-replacement (PO–WR) and without-replacement (PO–WOR) modes, with the key property that inclusion events are independent across data points. The framework provides unbiased estimation, supports efficient streaming and distributed settings, yields improved estimator variance under optimal weighting, and underpins tight privacy amplification in the context of differential privacy.

1. Formal Definition and Statistical Principles

Let D={1,…,N}\mathcal{D} = \{1, \ldots, N\} index a dataset, with subsampling probabilities μi∈[0,1]\mu_i \in [0,1], ∑i=1Nμi=n\sum_{i=1}^N \mu_i = n (target sample size). Poisson subsampling independently selects each index ii according to:

  • With replacement (PO–WR): Si∼Poisson(μi)S_i \sim \text{Poisson}(\mu_i). Each datum may appear multiple times; total expected sample size is nn.
  • Without replacement (PO–WOR): Si∼Bernoulli(μi)S_i \sim \text{Bernoulli}(\mu_i); at most one inclusion per entry, expected total nn.

Observed set: S={i:Si>0}\mathcal{S} = \{ i : S_i > 0\}.

For empirical risk minimization or estimating equations, the Poisson-inclusion-weighted empirical risk is

ℓ^μ(θ)=∑i∈DSiμiℓi(θ),\hat\ell_\mu(\theta) = \sum_{i \in \mathcal{D}} \frac{S_i}{\mu_i} \ell_i(\theta),

which is unbiased: μi∈[0,1]\mu_i \in [0,1]0 (Imberg et al., 2023).

This independence structure is conducive to low-memory, streaming, and parallel implementations—no central sample-size coordination or state needs to be maintained.

2. Optimal Subsampling Probabilities and Estimator Properties

Estimator efficiency under Poisson subsampling is governed by the choice of inclusion probabilities μi∈[0,1]\mu_i \in [0,1]1, typically tuned to minimize the variance of the resultant estimator in linear or generalized linear models, or more generally under estimating equations.

Given full-data optimum μi∈[0,1]\mu_i \in [0,1]2, the asymptotic distribution of the subsampled estimator μi∈[0,1]\mu_i \in [0,1]3 is

μi∈[0,1]\mu_i \in [0,1]4

with sandwich covariance μi∈[0,1]\mu_i \in [0,1]5, where μi∈[0,1]\mu_i \in [0,1]6 is the Hessian over full data and μi∈[0,1]\mu_i \in [0,1]7 is given explicitly for each subsampling variant (Imberg et al., 2023).

A- and L-optimality: Minimizing μi∈[0,1]\mu_i \in [0,1]8 (A-optimality) or μi∈[0,1]\mu_i \in [0,1]9 (L-optimality) yields Poisson scores proportional to ∑i=1Nμi=n\sum_{i=1}^N \mu_i = n0 or other problem-dependent quantities (Yu et al., 2020, Li et al., 28 Aug 2025): ∑i=1Nμi=n\sum_{i=1}^N \mu_i = n1

The optimality conditions admit a closed-form solution (for PO–WR/PO–WOR) via capping and normalization, which permits scalable implementation even in massive-data settings. Empirically, invariant linear criteria (e.g., ∑i=1Nμi=n\sum_{i=1}^N \mu_i = n2-optimality) achieve >90% ∑i=1Nμi=n\sum_{i=1}^N \mu_i = n3-efficiency at a fraction of D-opt computation (Imberg et al., 2023).

Variance comparison: Poisson subsampling (PO–WOR) admits lower or equal estimator variance to subsampling with replacement, especially as subsample size becomes a non-negligible fraction of full data (∑i=1Nμi=n\sum_{i=1}^N \mu_i = n4) (Wang et al., 2022).

3. Methodological Scope and Application Domains

Linear and Generalized Linear Models

  • Linear regression: Poisson schemes yield consistent and asymptotically normal OLS estimates under mild conditions, with sharp error bounds in finite-sample settings (Zhu, 2015).
  • Quasi-likelihood and GEE: Adapted optimality criteria and two-step pilot procedures scale Poisson subsampling to very high dimensions (growing ∑i=1Nμi=n\sum_{i=1}^N \mu_i = n5) and longitudinal/GEE analyses (Yu et al., 2020, Li et al., 28 Aug 2025).
  • Poisson regression, log-link: Locally ∑i=1Nμi=n\sum_{i=1}^N \mu_i = n6-optimal Poisson subsampling is characterized by support regions with explicit moment and sensitivity conditions, yielding higher efficiency than naïve uniform or heuristic approaches in high-skew, count-data regimes (Reuter et al., 2024).
  • Loss function approximation (coresets): Poisson subsampling grounds sensitivity-based coreset constructions for Poisson generalized linear models under identity and square-root links, with provable sublinear size in ∑i=1Nμi=n\sum_{i=1}^N \mu_i = n7 and logarithmic dependence on problem parameters (Lie et al., 2024).

Distributed and Streaming Analytics

For massive datasets distributed across blocks or disks, Poisson inclusion probabilities can be computed and realized in a streaming fashion using only local statistics, enabling scalable aggregation and maintaining optimality (Yu et al., 2020). Asymptotic normality of the aggregated estimator is preserved under mild additional constraints.

4. Privacy Amplification and Differential Privacy

Poisson subsampling is central to privacy amplification, a key paradigm in differentially private learning and statistical release:

  • Amplification theorem: If mechanism ∑i=1Nμi=n\sum_{i=1}^N \mu_i = n8 is ∑i=1Nμi=n\sum_{i=1}^N \mu_i = n9-DP, then ii0 via Poisson subsampling with rate ii1 enjoys strictly better ii2 parameters: ii3, ii4 (Balle et al., 2018, Chua et al., 2024, Feldman et al., 19 Feb 2026).
  • PLD realization: Privacy Loss Distribution (PLD) accounting for Poisson subsampling supports exact, numerically stable evaluation of ii5, via an explicit transform on the PLD of the underlying mechanism, efficient in ii6 time (Feldman et al., 19 Feb 2026).
  • Comparison with random allocation/shuffling: Recent work establishes that random allocation (balls-in-bins) sampling can match or exceed the privacy amplification of Poisson under empirical conditions (Feldman et al., 19 Feb 2026). Naively applying Poisson bounds in shuffling scenarios overstates true privacy (Chua et al., 2024).
  • DP-SGD and parallel implementations: Poisson subsampling is the gold standard for privacy accounting in stochastic optimization, with scalable implementation via capped and padded batched selection compatible with MapReduce, Spark, and TPU paradigms (Chua et al., 2024).

5. Poisson Subsampling in Bayesian and Monte Carlo Inference

The block-Poisson estimator provides an exact, unbiased estimator for data likelihood in pseudo-marginal and Hamiltonian MCMC. The estimator is constructed via a product of independent Poisson expansions, each applied to a batch-corrected, variance-reduced control variate of the log-likelihood. This design allows for:

  • Tuning the variance via mini-batch size, Poisson rate, and block count.
  • Positive correlation induction across MCMC draws for improved chain mixing (correlation ii7 for ii8 blocks).
  • Importance weighting and signed-correction schemes in the presence of possibly negative estimates.
  • Empirical acceleration over Firefly MCMC and Zig-Zag methods, with up to ii9-fold efficiency gains in high dimensions and large Si∼Poisson(μi)S_i \sim \text{Poisson}(\mu_i)0 (Quiroz et al., 2016).

6. Geometric and Computer Graphics Applications: Poisson-disk Subsampling

In computational geometry, Poisson-disk (blue-noise) subsampling generalizes the Poisson principle to select spatial subsets of 3D point clouds or meshes with maximal minimum interpoint distance:

  • Objective: maximize Si∼Poisson(μi)S_i \sim \text{Poisson}(\mu_i)1 for Si∼Poisson(μi)S_i \sim \text{Poisson}(\mu_i)2, targeting spatial uniformity.
  • Efficient algorithms combine Si∼Poisson(μi)S_i \sim \text{Poisson}(\mu_i)3-nearest neighbors cost definitions, voxel hashed point storage, and deferred local recomputation for out-of-core scalability.
  • Feature-aware extensions modify the inclusion cost based on surface normals or color similarity, preserving sharp geometric and textural structures.
  • The resulting method yields high visual uniformity and scalability to Si∼Poisson(μi)S_i \sim \text{Poisson}(\mu_i)4 points with linear time/space complexity (Comino-Trinidad et al., 2023).

7. Limitations and Open Directions

  • Sensitivity to pilot estimation: Poisson optimality depends on credible pilot estimates (e.g., Si∼Poisson(μi)S_i \sim \text{Poisson}(\mu_i)5), mitigated by two-stage pilot-main routines.
  • Random sample size: The Poisson approach yields a random realized subsample size, concentrated around expectation, but may be less desirable in applications with fixed quotas (Wang et al., 2022).
  • Sublinear coresets in complex or nonstandard models: For higher-order root-link or non-glm loss functions, the domain-shifting and sensitivity framework may not yield sublinear coresets, leaving open the need for new summarization paradigms (Lie et al., 2024).
  • Downstream privacy loss in complex subsampling regimes: Proper accounting in correlated or structured sampling (e.g., banded noise, Si∼Poisson(μi)S_i \sim \text{Poisson}(\mu_i)6-min-sep) requires advanced MC privacy accounting or dynamic programming, and is an active area of research (Dong et al., 10 Feb 2026).

Summary Table: Key Poisson Subsampling Variants

Variant/Domain Principle Optimality/Statistical Feature
PO–WOR (classical) Bernoulli(Si∼Poisson(μi)S_i \sim \text{Poisson}(\mu_i)7) inclusion Asymptotically normal, lower variance, streaming-friendly (Imberg et al., 2023, Wang et al., 2022)
Block-Poisson MCMC Poisson series expansion Exact likelihood estimation, gradient blocking, strong efficiency (Quiroz et al., 2016)
Differential Privacy (DP) Amplification via Poisson Tight Si∼Poisson(μi)S_i \sim \text{Poisson}(\mu_i)8 accounting, PLD transforms, implementation at scale (Chua et al., 2024, Feldman et al., 19 Feb 2026)
Coreset/GLM Sensitivity sampling Sublinear-size coresets in low-complexity regimes (Lie et al., 2024)
Geometric/PointCloud Spatial Si∼Poisson(μi)S_i \sim \text{Poisson}(\mu_i)9-NN Poisson-disk Maximal min distance, out-of-core, feature-extensions (Comino-Trinidad et al., 2023)

The Poisson subsampling paradigm and its optimal variants constitute a unifying thread across scalable statistical computation, efficient sampling design, advanced Monte Carlo methods, high-dimensional data privacy, and pragmatic geometric modeling. Research continues into sharper optimality criteria, domain-specific coreset construction, refined privacy accounting in structured subsampling procedures, and algorithms for ever-larger data and more intricate stochastic processes.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Poisson Subsampling.