Resample-Aggregate Framework

Updated 21 August 2025

The Resample-Aggregate Framework is a principled algebraic method that unifies diverse randomized sampling techniques using Generalized Uniform Sampling (GUS) for unbiased aggregate estimation.
It leverages SOA-equivalence to allow reordering of sampling operators with selection and join operations, optimizing query processing in relational databases.
The framework's design supports automated computation of variance and confidence intervals, though extending its methods to non-linear aggregates remains an open challenge.

The Resample-Aggregate Framework provides a principled algebraic foundation for approximate aggregate estimation in relational databases, centering on the abstraction of Generalized Uniform Sampling (GUS). By unifying a broad array of randomized relational sampling operators under a single theoretical umbrella, the framework enables systematic derivation of unbiased estimators and their variance, supports automated confidence interval calculation, and allows the commutation of sampling operators with core relational algebra elements—thereby streamlining integration of sampling-based analytics within query execution engines.

1. Generalized Uniform Sampling: Definition and Scope

Generalized Uniform Sampling (GUS) is defined as a class of randomized sampling methods operating over the cross product of base relations, $R = R_1 \times R_2 \times \ldots \times R_n$ . A GUS operator samples a subset $\mathcal{R}$ of tuples with the following core properties:

The marginal inclusion probability $a = P[t \in \mathcal{R}]$ of any tuple $t$ depends on its lineage (i.e., tuple identifiers), not its contents.
The joint inclusion probability for any pair of tuples $b_T = P[t, t' \in \mathcal{R} \mid \text{for all } i \in T, t_i = t'_i; \text{ for } j \notin T, t_j \neq t'_j]$ depends only on which components (relations) $t$ and $t'$ share lineage.

This formalism subsumes Bernoulli sampling, fixed-size sampling without replacement, block/chained sampling, and other complex schemes, each parameterizable through the appropriate $a$ and $b_T$ . Under GUS, many conventional and advanced sampling algorithms found in SQL-based systems are unified, facilitating a data-agnostic approach to aggregate estimation.

2. Algebraic Equivalence and Commutativity

The framework introduces Second Order Analytical (SOA) equivalence: two potentially randomized query plans, $\mathcal{E}(R)$ and $\mathcal{F}(R)$ , are SOA-equivalent if for all sum-like aggregate functions $f$ ,

$\mathbb{E}[\mathcal{A}_f(\mathcal{E}(R))] = \mathbb{E}[\mathcal{A}_f(\mathcal{F}(R))]~\text{and}~\text{Var}[\mathcal{A}_f(\mathcal{E}(R))] = \text{Var}[\mathcal{A}_f(\mathcal{F}(R))]$

where $\mathcal{A}_f(Q)$ applies $f$ as an aggregate over $Q$ .

Main commutativity properties established:

Selection-GUS Commutativity (Proposition 2):

$\sigma_C(\mathcal{G}(\cdot)) \equiv_{\text{SOA}} \mathcal{G}(\sigma_C(\cdot))$

The selection predicate can be reordered with a GUS sampling operator without affecting the analysis of aggregate estimator expectation or variance.

Join-GUS Commutativity (Proposition 3):

When relations $R, S$ are independently sampled by GUS operators $\mathcal{G}_1,\mathcal{G}_2$ with disjoint lineage,

$\mathcal{G}_1(R) \bowtie \mathcal{G}_2(S) \equiv_{\text{SOA}} \mathcal{G}(R \bowtie S)$

with combined parameters $a = a_1 a_2$ and $b_T = b_{T_1}^{(1)} b_{T_2}^{(2)}$ .

The net consequence: sampling operators in complex query plans can be “pushed up” to just before the aggregate, reducing the problem to analysis over a single GUS sample.

3. Unbiased Aggregate Estimation and Variance Expressions

For an aggregate $A = \sum_t f(t)$ and a GUS sample, the unbiased estimator is

$X = \frac{1}{a} \sum_{t \in \mathcal{R}} f(t)$

with mean $\mathbb{E}[X] = A$ .

Variance is given as a linear combination: $\sigma^2(X) = \sum_{S \subseteq \{1:n\}} \frac{c_S}{a^2} y_S - y_\emptyset$ where

$y_S = \sum_{t \in T_S^*} \sum_{t' \in T_{S, t}^*} f(t) f(t')$

with $T_S^*, T_{S, t}^*$ denoting appropriate groupings by lineage, and $c_S$ are functions of the $b_T$ parameters: $c_S = \sum_{T \subseteq \{1:n\}} (-1)^{|T|+|S|} b_T$ This concise algebraic structure enables automated, SQL-implementable computation of confidence intervals.

Standard intervals:

Normal-based (optimistic, 95%): $[X - 1.96 \sigma, X + 1.96 \sigma]$
Chebyshev-based (conservative): $[X - 4.47 \sigma, X + 4.47 \sigma]$

4. System Integration and Practical Use

By expressing sampling behavior in terms of GUS, a database engine requires only lineage (tuple IDs) tracking and the evaluation of the GUS parameters for each sampled path. This minimal change allows integration into production database systems, as illustrated using an example on the TPC-H schema with per-table sampling. The process involves:

Translating each TABLESAMPLE or similar sampling statement into GUS parameters ( $a, b_T$ ).
Rewriting query plans algebraically to move all sampling to a single GUS operator pre-aggregate (using SOA-equivalence and commutativity).
Routing sampled tuples and their lineages to an “estimation component” (called “SBox”), which computes unbiased aggregate and confidence bounds via the equations above.

Potential system-side limitations include expensive GROUP BY steps for computing the $y_S$ moments in intermediate datasets and lack of direct support for non-linear aggregates.

5. Extensions and Limitations

The framework’s algebra is tailored to linear (SUM-like) aggregates. Extension to non-linear aggregates (e.g., AVERAGE, MIN, MAX, DISTINCT) is presently unresolved, though the use of the delta method or similar statistical tools is suggested for future research. Theoretical extension to duplicate-producing sampling, such as true sampling with replacement, would require an enhanced algebraic treatment allowing sampling operators to be viewed as multiset-valued rather than set-valued. Self-joins introduce challenges due to tuple dependency, and exact variance accounting would require joint probabilities involving higher-order tuple collections, not just pairs.

6. Research Trajectory and Open Problems

Highlighted directions for further development include:

Integrating approximate methods (e.g., delta method) to treat nonlinear function aggregates,
Systematic incorporation of random set theory to analyze more involved relational constructs (such as self-joins),
Extension to handle duplicate-including sampling schemes,
Practical dynamic tuning (“simulation”) of sampling parameters to meet query error or latency objectives in streaming and adaptive workloads,
Evaluation of the framework’s computational bottlenecks (notably in terms of lineage-tracking and generalized moment computation) in very large-scale production scenarios.

7. Summary Table: Core Elements of the Resample-Aggregate Algebra

Concept	Formalism	Implementation Consequence
GUS Operator	$a$ , $b_T$ parameters	Unifies sampling strategies
SOA-Equivalence	Equal mean/variance for sum-like aggs	Safe operator commutation rules
Variance Calc.	$\sigma^2(X) = \sum_S (c_S/a^2) y_S - y_\emptyset$	Enables automated confidence bounds
Commutativity	Algebraic equivalence for selection/join	Sampling "push-up" optimization

The Resample-Aggregate Framework thereby provides a rigorous, algebraic, and system-compatible foundation for integrating advanced sampling-based aggregate estimation into relational query processing. Its generalization and compositionality pave the way for systems that combine performance and statistical rigor in large-scale data analytics.

PDF Markdown Chat (Upgrade)