Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
121 tokens/sec
GPT-4o
9 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Distributed Bayesian clustering using finite mixture of mixtures (2003.13936v2)

Published 31 Mar 2020 in stat.CO and stat.ME

Abstract: In many modern applications, there is interest in analyzing enormous data sets that cannot be easily moved across computers or loaded into memory on a single computer. In such settings, it is very common to be interested in clustering. Existing distributed clustering algorithms are mostly distance or density based without a likelihood specification, precluding the possibility of formal statistical inference. Model-based clustering allows statistical inference, yet research on distributed inference has emphasized nonparametric Bayesian mixture models over finite mixture models. To fill this gap, we introduce a nearly embarrassingly parallel algorithm for clustering under a Bayesian overfitted finite mixture of Gaussian mixtures, which we term distributed Bayesian clustering (DIB-C). DIB-C can flexibly accommodate data sets with various shapes (e.g. skewed or multi-modal). With data randomly partitioned and distributed, we first run Markov chain Monte Carlo in an embarrassingly parallel manner to obtain local clustering draws and then refine across workers for a final clustering estimate based on any loss function on the space of partitions. DIB-C can also estimate cluster densities, quickly classify new subjects and provide a posterior predictive distribution. Both simulation studies and real data applications show superior performance of DIB-C in terms of robustness and computational efficiency.

Summary

We haven't generated a summary for this paper yet.