Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 99 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 36 tok/s
GPT-5 High 40 tok/s Pro
GPT-4o 99 tok/s
GPT OSS 120B 461 tok/s Pro
Kimi K2 191 tok/s Pro
2000 character limit reached

Resample-Aggregate Framework

Updated 21 August 2025
  • The Resample-Aggregate Framework is a principled algebraic method that unifies diverse randomized sampling techniques using Generalized Uniform Sampling (GUS) for unbiased aggregate estimation.
  • It leverages SOA-equivalence to allow reordering of sampling operators with selection and join operations, optimizing query processing in relational databases.
  • The framework's design supports automated computation of variance and confidence intervals, though extending its methods to non-linear aggregates remains an open challenge.

The Resample-Aggregate Framework provides a principled algebraic foundation for approximate aggregate estimation in relational databases, centering on the abstraction of Generalized Uniform Sampling (GUS). By unifying a broad array of randomized relational sampling operators under a single theoretical umbrella, the framework enables systematic derivation of unbiased estimators and their variance, supports automated confidence interval calculation, and allows the commutation of sampling operators with core relational algebra elements—thereby streamlining integration of sampling-based analytics within query execution engines.

1. Generalized Uniform Sampling: Definition and Scope

Generalized Uniform Sampling (GUS) is defined as a class of randomized sampling methods operating over the cross product of base relations, R=R1×R2××RnR = R_1 \times R_2 \times \ldots \times R_n. A GUS operator samples a subset R\mathcal{R} of tuples with the following core properties:

  • The marginal inclusion probability a=P[tR]a = P[t \in \mathcal{R}] of any tuple tt depends on its lineage (i.e., tuple identifiers), not its contents.
  • The joint inclusion probability for any pair of tuples bT=P[t,tRfor all iT,ti=ti; for jT,tjtj]b_T = P[t, t' \in \mathcal{R} \mid \text{for all } i \in T, t_i = t'_i; \text{ for } j \notin T, t_j \neq t'_j] depends only on which components (relations) tt and tt' share lineage.

This formalism subsumes Bernoulli sampling, fixed-size sampling without replacement, block/chained sampling, and other complex schemes, each parameterizable through the appropriate aa and bTb_T. Under GUS, many conventional and advanced sampling algorithms found in SQL-based systems are unified, facilitating a data-agnostic approach to aggregate estimation.

2. Algebraic Equivalence and Commutativity

The framework introduces Second Order Analytical (SOA) equivalence: two potentially randomized query plans, E(R)\mathcal{E}(R) and F(R)\mathcal{F}(R), are SOA-equivalent if for all sum-like aggregate functions ff,

E[Af(E(R))]=E[Af(F(R))] and Var[Af(E(R))]=Var[Af(F(R))]\mathbb{E}[\mathcal{A}_f(\mathcal{E}(R))] = \mathbb{E}[\mathcal{A}_f(\mathcal{F}(R))]~\text{and}~\text{Var}[\mathcal{A}_f(\mathcal{E}(R))] = \text{Var}[\mathcal{A}_f(\mathcal{F}(R))]

where Af(Q)\mathcal{A}_f(Q) applies ff as an aggregate over QQ.

Main commutativity properties established:

  • Selection-GUS Commutativity (Proposition 2):

σC(G())SOAG(σC())\sigma_C(\mathcal{G}(\cdot)) \equiv_{\text{SOA}} \mathcal{G}(\sigma_C(\cdot))

The selection predicate can be reordered with a GUS sampling operator without affecting the analysis of aggregate estimator expectation or variance.

  • Join-GUS Commutativity (Proposition 3):

When relations R,SR, S are independently sampled by GUS operators G1,G2\mathcal{G}_1,\mathcal{G}_2 with disjoint lineage,

G1(R)G2(S)SOAG(RS)\mathcal{G}_1(R) \bowtie \mathcal{G}_2(S) \equiv_{\text{SOA}} \mathcal{G}(R \bowtie S)

with combined parameters a=a1a2a = a_1 a_2 and bT=bT1(1)bT2(2)b_T = b_{T_1}^{(1)} b_{T_2}^{(2)}.

The net consequence: sampling operators in complex query plans can be “pushed up” to just before the aggregate, reducing the problem to analysis over a single GUS sample.

3. Unbiased Aggregate Estimation and Variance Expressions

For an aggregate A=tf(t)A = \sum_t f(t) and a GUS sample, the unbiased estimator is

X=1atRf(t)X = \frac{1}{a} \sum_{t \in \mathcal{R}} f(t)

with mean E[X]=A\mathbb{E}[X] = A.

Variance is given as a linear combination: σ2(X)=S{1:n}cSa2ySy\sigma^2(X) = \sum_{S \subseteq \{1:n\}} \frac{c_S}{a^2} y_S - y_\emptyset where

yS=tTStTS,tf(t)f(t)y_S = \sum_{t \in T_S^*} \sum_{t' \in T_{S, t}^*} f(t) f(t')

with TS,TS,tT_S^*, T_{S, t}^* denoting appropriate groupings by lineage, and cSc_S are functions of the bTb_T parameters: cS=T{1:n}(1)T+SbTc_S = \sum_{T \subseteq \{1:n\}} (-1)^{|T|+|S|} b_T This concise algebraic structure enables automated, SQL-implementable computation of confidence intervals.

Standard intervals:

  • Normal-based (optimistic, 95%): [X1.96σ,X+1.96σ][X - 1.96 \sigma, X + 1.96 \sigma]
  • Chebyshev-based (conservative): [X4.47σ,X+4.47σ][X - 4.47 \sigma, X + 4.47 \sigma]

4. System Integration and Practical Use

By expressing sampling behavior in terms of GUS, a database engine requires only lineage (tuple IDs) tracking and the evaluation of the GUS parameters for each sampled path. This minimal change allows integration into production database systems, as illustrated using an example on the TPC-H schema with per-table sampling. The process involves:

  • Translating each TABLESAMPLE or similar sampling statement into GUS parameters (a,bTa, b_T).
  • Rewriting query plans algebraically to move all sampling to a single GUS operator pre-aggregate (using SOA-equivalence and commutativity).
  • Routing sampled tuples and their lineages to an “estimation component” (called “SBox”), which computes unbiased aggregate and confidence bounds via the equations above.

Potential system-side limitations include expensive GROUP BY steps for computing the ySy_S moments in intermediate datasets and lack of direct support for non-linear aggregates.

5. Extensions and Limitations

The framework’s algebra is tailored to linear (SUM-like) aggregates. Extension to non-linear aggregates (e.g., AVERAGE, MIN, MAX, DISTINCT) is presently unresolved, though the use of the delta method or similar statistical tools is suggested for future research. Theoretical extension to duplicate-producing sampling, such as true sampling with replacement, would require an enhanced algebraic treatment allowing sampling operators to be viewed as multiset-valued rather than set-valued. Self-joins introduce challenges due to tuple dependency, and exact variance accounting would require joint probabilities involving higher-order tuple collections, not just pairs.

6. Research Trajectory and Open Problems

Highlighted directions for further development include:

  • Integrating approximate methods (e.g., delta method) to treat nonlinear function aggregates,
  • Systematic incorporation of random set theory to analyze more involved relational constructs (such as self-joins),
  • Extension to handle duplicate-including sampling schemes,
  • Practical dynamic tuning (“simulation”) of sampling parameters to meet query error or latency objectives in streaming and adaptive workloads,
  • Evaluation of the framework’s computational bottlenecks (notably in terms of lineage-tracking and generalized moment computation) in very large-scale production scenarios.

7. Summary Table: Core Elements of the Resample-Aggregate Algebra

Concept Formalism Implementation Consequence
GUS Operator aa, bTb_T parameters Unifies sampling strategies
SOA-Equivalence Equal mean/variance for sum-like aggs Safe operator commutation rules
Variance Calc. σ2(X)=S(cS/a2)ySy\sigma^2(X) = \sum_S (c_S/a^2) y_S - y_\emptyset Enables automated confidence bounds
Commutativity Algebraic equivalence for selection/join Sampling "push-up" optimization

The Resample-Aggregate Framework thereby provides a rigorous, algebraic, and system-compatible foundation for integrating advanced sampling-based aggregate estimation into relational query processing. Its generalization and compositionality pave the way for systems that combine performance and statistical rigor in large-scale data analytics.