Poisson Subsampling Methods
- Poisson subsampling is a randomized technique that independently includes each data point, enabling unbiased estimation and scalable inference.
- It optimizes inclusion probabilities to minimize estimator variance, employing both with-replacement and without-replacement schemes.
- It underpins applications in differential privacy, streaming analytics, and coreset constructions, yielding practical efficiency and privacy gains.
Poisson subsampling is a class of randomized subsampling methods that includes each data item independently with some specified probability, resulting in a variable-size, inclusion-weighted subsample. These procedures are central to scalable inference, optimal experimental design, and privacy amplification in modern large-scale statistical and machine learning pipelines. They are implemented in both with-replacement (PO–WR) and without-replacement (PO–WOR) modes, with the key property that inclusion events are independent across data points. The framework provides unbiased estimation, supports efficient streaming and distributed settings, yields improved estimator variance under optimal weighting, and underpins tight privacy amplification in the context of differential privacy.
1. Formal Definition and Statistical Principles
Let index a dataset, with subsampling probabilities , (target sample size). Poisson subsampling independently selects each index according to:
- With replacement (PO–WR): . Each datum may appear multiple times; total expected sample size is .
- Without replacement (PO–WOR): ; at most one inclusion per entry, expected total .
Observed set: .
For empirical risk minimization or estimating equations, the Poisson-inclusion-weighted empirical risk is
which is unbiased: 0 (Imberg et al., 2023).
This independence structure is conducive to low-memory, streaming, and parallel implementations—no central sample-size coordination or state needs to be maintained.
2. Optimal Subsampling Probabilities and Estimator Properties
Estimator efficiency under Poisson subsampling is governed by the choice of inclusion probabilities 1, typically tuned to minimize the variance of the resultant estimator in linear or generalized linear models, or more generally under estimating equations.
Given full-data optimum 2, the asymptotic distribution of the subsampled estimator 3 is
4
with sandwich covariance 5, where 6 is the Hessian over full data and 7 is given explicitly for each subsampling variant (Imberg et al., 2023).
A- and L-optimality: Minimizing 8 (A-optimality) or 9 (L-optimality) yields Poisson scores proportional to 0 or other problem-dependent quantities (Yu et al., 2020, Li et al., 28 Aug 2025): 1
The optimality conditions admit a closed-form solution (for PO–WR/PO–WOR) via capping and normalization, which permits scalable implementation even in massive-data settings. Empirically, invariant linear criteria (e.g., 2-optimality) achieve >90% 3-efficiency at a fraction of D-opt computation (Imberg et al., 2023).
Variance comparison: Poisson subsampling (PO–WOR) admits lower or equal estimator variance to subsampling with replacement, especially as subsample size becomes a non-negligible fraction of full data (4) (Wang et al., 2022).
3. Methodological Scope and Application Domains
Linear and Generalized Linear Models
- Linear regression: Poisson schemes yield consistent and asymptotically normal OLS estimates under mild conditions, with sharp error bounds in finite-sample settings (Zhu, 2015).
- Quasi-likelihood and GEE: Adapted optimality criteria and two-step pilot procedures scale Poisson subsampling to very high dimensions (growing 5) and longitudinal/GEE analyses (Yu et al., 2020, Li et al., 28 Aug 2025).
- Poisson regression, log-link: Locally 6-optimal Poisson subsampling is characterized by support regions with explicit moment and sensitivity conditions, yielding higher efficiency than naïve uniform or heuristic approaches in high-skew, count-data regimes (Reuter et al., 2024).
- Loss function approximation (coresets): Poisson subsampling grounds sensitivity-based coreset constructions for Poisson generalized linear models under identity and square-root links, with provable sublinear size in 7 and logarithmic dependence on problem parameters (Lie et al., 2024).
Distributed and Streaming Analytics
For massive datasets distributed across blocks or disks, Poisson inclusion probabilities can be computed and realized in a streaming fashion using only local statistics, enabling scalable aggregation and maintaining optimality (Yu et al., 2020). Asymptotic normality of the aggregated estimator is preserved under mild additional constraints.
4. Privacy Amplification and Differential Privacy
Poisson subsampling is central to privacy amplification, a key paradigm in differentially private learning and statistical release:
- Amplification theorem: If mechanism 8 is 9-DP, then 0 via Poisson subsampling with rate 1 enjoys strictly better 2 parameters: 3, 4 (Balle et al., 2018, Chua et al., 2024, Feldman et al., 19 Feb 2026).
- PLD realization: Privacy Loss Distribution (PLD) accounting for Poisson subsampling supports exact, numerically stable evaluation of 5, via an explicit transform on the PLD of the underlying mechanism, efficient in 6 time (Feldman et al., 19 Feb 2026).
- Comparison with random allocation/shuffling: Recent work establishes that random allocation (balls-in-bins) sampling can match or exceed the privacy amplification of Poisson under empirical conditions (Feldman et al., 19 Feb 2026). Naively applying Poisson bounds in shuffling scenarios overstates true privacy (Chua et al., 2024).
- DP-SGD and parallel implementations: Poisson subsampling is the gold standard for privacy accounting in stochastic optimization, with scalable implementation via capped and padded batched selection compatible with MapReduce, Spark, and TPU paradigms (Chua et al., 2024).
5. Poisson Subsampling in Bayesian and Monte Carlo Inference
The block-Poisson estimator provides an exact, unbiased estimator for data likelihood in pseudo-marginal and Hamiltonian MCMC. The estimator is constructed via a product of independent Poisson expansions, each applied to a batch-corrected, variance-reduced control variate of the log-likelihood. This design allows for:
- Tuning the variance via mini-batch size, Poisson rate, and block count.
- Positive correlation induction across MCMC draws for improved chain mixing (correlation 7 for 8 blocks).
- Importance weighting and signed-correction schemes in the presence of possibly negative estimates.
- Empirical acceleration over Firefly MCMC and Zig-Zag methods, with up to 9-fold efficiency gains in high dimensions and large 0 (Quiroz et al., 2016).
6. Geometric and Computer Graphics Applications: Poisson-disk Subsampling
In computational geometry, Poisson-disk (blue-noise) subsampling generalizes the Poisson principle to select spatial subsets of 3D point clouds or meshes with maximal minimum interpoint distance:
- Objective: maximize 1 for 2, targeting spatial uniformity.
- Efficient algorithms combine 3-nearest neighbors cost definitions, voxel hashed point storage, and deferred local recomputation for out-of-core scalability.
- Feature-aware extensions modify the inclusion cost based on surface normals or color similarity, preserving sharp geometric and textural structures.
- The resulting method yields high visual uniformity and scalability to 4 points with linear time/space complexity (Comino-Trinidad et al., 2023).
7. Limitations and Open Directions
- Sensitivity to pilot estimation: Poisson optimality depends on credible pilot estimates (e.g., 5), mitigated by two-stage pilot-main routines.
- Random sample size: The Poisson approach yields a random realized subsample size, concentrated around expectation, but may be less desirable in applications with fixed quotas (Wang et al., 2022).
- Sublinear coresets in complex or nonstandard models: For higher-order root-link or non-glm loss functions, the domain-shifting and sensitivity framework may not yield sublinear coresets, leaving open the need for new summarization paradigms (Lie et al., 2024).
- Downstream privacy loss in complex subsampling regimes: Proper accounting in correlated or structured sampling (e.g., banded noise, 6-min-sep) requires advanced MC privacy accounting or dynamic programming, and is an active area of research (Dong et al., 10 Feb 2026).
Summary Table: Key Poisson Subsampling Variants
| Variant/Domain | Principle | Optimality/Statistical Feature |
|---|---|---|
| PO–WOR (classical) | Bernoulli(7) inclusion | Asymptotically normal, lower variance, streaming-friendly (Imberg et al., 2023, Wang et al., 2022) |
| Block-Poisson MCMC | Poisson series expansion | Exact likelihood estimation, gradient blocking, strong efficiency (Quiroz et al., 2016) |
| Differential Privacy (DP) | Amplification via Poisson | Tight 8 accounting, PLD transforms, implementation at scale (Chua et al., 2024, Feldman et al., 19 Feb 2026) |
| Coreset/GLM | Sensitivity sampling | Sublinear-size coresets in low-complexity regimes (Lie et al., 2024) |
| Geometric/PointCloud | Spatial 9-NN Poisson-disk | Maximal min distance, out-of-core, feature-extensions (Comino-Trinidad et al., 2023) |
The Poisson subsampling paradigm and its optimal variants constitute a unifying thread across scalable statistical computation, efficient sampling design, advanced Monte Carlo methods, high-dimensional data privacy, and pragmatic geometric modeling. Research continues into sharper optimality criteria, domain-specific coreset construction, refined privacy accounting in structured subsampling procedures, and algorithms for ever-larger data and more intricate stochastic processes.