Clustering with Non-adaptive Subset Queries (2409.10908v1)

Published 17 Sep 2024 in cs.DS and cs.LG

Abstract: Recovering the underlying clustering of a set $U$ of $n$ points by asking pair-wise same-cluster queries has garnered significant interest in the last decade. Given a query $S \subset U$, $|S|=2$, the oracle returns yes if the points are in the same cluster and no otherwise. For adaptive algorithms with pair-wise queries, the number of required queries is known to be $\Theta(nk)$, where $k$ is the number of clusters. However, non-adaptive schemes require $\Omega(n^2)$ queries, which matches the trivial $O(n^2)$ upper bound attained by querying every pair of points. To break the quadratic barrier for non-adaptive queries, we study a generalization of this problem to subset queries for $|S|>2$, where the oracle returns the number of clusters intersecting $S$. Allowing for subset queries of unbounded size, $O(n)$ queries is possible with an adaptive scheme (Chakrabarty-Liao, 2024). However, the realm of non-adaptive algorithms is completely unknown. In this paper, we give the first non-adaptive algorithms for clustering with subset queries. Our main result is a non-adaptive algorithm making $O(n \log k \cdot (\log k + \log\log n)^2)$ queries, which improves to $O(n \log \log n)$ when $k$ is a constant. We also consider algorithms with a restricted query size of at most $s$. In this setting we prove that $\Omega(\max(n^{2/s^2,n))$} queries are necessary and obtain algorithms making $\tilde{O}(n^{2k/s^2)$} queries for any $s \leq \sqrt{n}$ and $\tilde{O}(n^2/s)$ queries for any $s \leq n$. We also consider the natural special case when the clusters are balanced, obtaining non-adaptive algorithms which make $O(n \log k) + \tilde{O}(k)$ and $O(n\log² k)$ queries. Finally, allowing two rounds of adaptivity, we give an algorithm making $O(n \log k)$ queries in the general case and $O(n \log \log k)$ queries when the clusters are balanced.

Authors (4)

Hadley Black (10 papers)
Euiwoong Lee (64 papers)
Arya Mazumdar (89 papers)
Barna Saha (43 papers)

Summary

Clustering with Non-Adaptive Subset Queries

In the paper "Clustering with Non-adaptive Subset Queries" by Hadley Black, Euiwoong Lee, Arya Mazumdar, and Barna Saha, the authors investigate the problem of recovering the underlying clustering of a set $U$ of $n$ points utilizing non-adaptive subset queries. The inquiry models involve an oracle that, for a subset $S \subset U$ , returns the number of clusters intersecting $S$ . The primary goal is to determine the minimum number of queries needed to exactly recover an arbitrary $k$ -clustering.

Summary of Results

The authors provide several results, contrasting scenarios with bounded and unbounded subset queries, adaptive and non-adaptive models, and balanced and unbalanced clusters.

Unbounded Queries:
- Main Result: A non-adaptive algorithm making $O(n \log k \cdot (\log k + \log \log n)^2)$ queries, which improves to $O(n \log \log n)$ when $k$ is a constant.
- Small $k$ : An algorithm making $O(n \log \log n \cdot k \log k)$ queries for small $k$ , leveraging combinatorial group testing techniques.
Bounded Queries:
- General Bound $s$ on Query Size: A non-adaptive algorithm making $\widetilde{O}(n^2 k / s^2)$ queries, achieving near-optimal results for constant $k$ .
- Special Case $s \approx \sqrt{n}$ : For $s \leq \sqrt{n}$ , an algorithm making $\widetilde{O}(n^2/s^2)$ queries.
Balanced Clusters:
- First Algorithm: For balanced clusters, a non-adaptive algorithm using $O(B^2 n \log k) + \widetilde{O}(k)$ queries for some constant $B$ .
- Second Algorithm: Improved algorithm using $O(B^2 n \log k \log k)$ queries.
Two Rounds of Adaptivity:
- General Case: A deterministic two-round algorithm making $O(n \log k)$ queries.
- Balanced Case: A randomized two-round algorithm making $O(n \log \log k)$ queries for balanced clusters.

Techniques and Approaches

The authors utilize a blend of combinatorial search techniques and probabilistic methods to achieve their results. A key component involves leveraging combinatorial group testing principles, particularly in scenarios dealing with unbounded query sizes. The transistor from simple pair queries (which are notably less efficient in non-adaptive models) to subset queries of larger sizes allows the algorithms to achieve significant improvements.

Practical Implications

The paper highlights the benefits of utilizing subset queries, which allow leveraging randomness and the probabilistic properties of subsets to reduce the number of necessary queries drastically. Non-adaptivity, a desirable quality in large-scale crowdsourcing applications and parallel computing, can now achieve efficiencies previously thought impossible.

Theoretical Implications and Open Problems

The work opens several avenues for further research, particularly around:

Achieving Linear Query Complexity: Whether a linear number of queries is achievable for general $k$ -clustering remains an open question.
Bounded Query Sizes: Improving upon the query complexity for arbitrary $k$ with smaller subset sizes remains an intriguing problem.
Adaptive Schemes: Optimizing the number of adaptive rounds to achieve the theoretically minimal number of queries.

Conclusion

This paper makes substantial strides in understanding non-adaptive clustering algorithms, particularly through the innovative use of subset queries. By elucidating the potential to drastically reduce query complexity, it sets the stage for both theoretical advancements and practical implementations in machine learning and data mining, where clustering is a fundamental task.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/MountainOfMoon/status/1836226119194022262