Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
133 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Sampling-based Estimation of the Number of Distinct Values in Distributed Environment (2206.05476v1)

Published 11 Jun 2022 in cs.DB

Abstract: In data mining, estimating the number of distinct values (NDV) is a fundamental problem with various applications. Existing methods for estimating NDV can be broadly classified into two categories: i) scanning-based methods, which scan the entire data and maintain a sketch to approximate NDV; and ii) sampling-based methods, which estimate NDV using sampling data rather than accessing the entire data warehouse. Scanning-based methods achieve a lower approximation error at the cost of higher I/O and more time. Sampling-based estimation is preferable in applications with a large data volume and a permissible error restriction due to its higher scalability. However, while the sampling-based method is more effective on a single machine, it is less practical in a distributed environment with massive data volumes. For obtaining the final NDV estimators, the entire sample must be transferred throughout the distributed system, incurring a prohibitive communication cost when the sample rate is significant. This paper proposes a novel sketch-based distributed method that achieves sub-linear communication costs for distributed sampling-based NDV estimation under mild assumptions. Our method leverages a sketch-based algorithm to estimate the sample's {\em frequency of frequency} in the {\em distributed streaming model}, which is compatible with most classical sampling-based NDV estimators. Additionally, we provide theoretical evidence for our method's ability to minimize communication costs in the worst-case scenario. Extensive experiments show that our method saves orders of magnitude in communication costs compared to existing sampling- and sketch-based methods.

Citations (3)

Summary

We haven't generated a summary for this paper yet.