On Markov chain Monte Carlo methods for tall data (1505.02827v1)

Published 11 May 2015 in stat.ME, stat.CO, and stat.ML

Abstract: Markov chain Monte Carlo methods are often deemed too computationally intensive to be of any practical use for big data applications, and in particular for inference on datasets containing a large number $n$ of individual data points, also known as tall datasets. In scenarios where data are assumed independent, various approaches to scale up the Metropolis-Hastings algorithm in a Bayesian inference context have been recently proposed in machine learning and computational statistics. These approaches can be grouped into two categories: divide-and-conquer approaches and, subsampling-based algorithms. The aims of this article are as follows. First, we present a comprehensive review of the existing literature, commenting on the underlying assumptions and theoretical guarantees of each method. Second, by leveraging our understanding of these limitations, we propose an original subsampling-based approach which samples from a distribution provably close to the posterior distribution of interest, yet can require less than $O(n)$ data point likelihood evaluations at each iteration for certain statistical models in favourable scenarios. Finally, we have only been able so far to propose subsampling-based methods which display good performance in scenarios where the Bernstein-von Mises approximation of the target posterior distribution is excellent. It remains an open challenge to develop such methods in scenarios where the Bernstein-von Mises approximation is poor.

Citations (268)

View on Semantic Scholar

Summary

The paper introduces a confidence sampler method that reduces the number of likelihood evaluations in MCMC for tall data.
It critiques existing divide-and-conquer and subsampling techniques, addressing trade-offs between scalability and estimator bias.
The findings offer practical insights for applying Bayesian inference to large-scale problems in areas like genomics and sensor networks.

Overview of "On Markov Chain Monte Carlo Methods for Tall Data"

This paper addresses the computational challenges posed by applying Markov Chain Monte Carlo (MCMC) methods to tall data, or datasets consisting of a large number of individual data points. Standard MCMC methods, like the Metropolis-Hastings (MH) algorithm, encounter considerable difficulty scaling to such datasets, primarily due to the necessity of evaluating the entire data set at each iteration. The paper focuses on two principal categories of approaches to mitigate computational intensity in a Bayesian inference context: divide-and-conquer methods and subsampling-based algorithms.

Divide-and-Conquer Approaches

Divide-and-conquer strategies partition the original dataset into manageable batches, performing MCMC on each independently, and subsequently recombining results to approximate the intended posterior distribution. While these methods have intuitive appeal, recombining batch-specific results accurately poses significant problems. Many of these methods rely on Gaussian assumptions, leading to inaccurate results if these assumptions are not met. The authors provide a thorough critique of the existing approaches, noting that current solutions often either lack scalability or suffer from exponential increases in error metrics relative to the number of batches.

Subsampling-Based Approaches

Subsampling approaches aim to enhance computational tractability by evaluating only portions of the dataset per iteration. Key developments include pseudo-marginal methods, which use subsample-based estimates to approximate likelihoods, thus sidestepping the computational burden of full data sweeps. However, challenges arise with these methods, including managing the variance of estimators and the potential loss of exactness compared to full evaluations. Specific variants, such as the proposed pseudo-marginal MCMC methods, maintain correctness under certain assumptions, yet often require extensive evaluation of data subsets to maintain low bias, limiting practical improvements.

Performance Analysis and New Methods

This paper proposes an original subsampling-based approach, the confidence sampler, which leverages confidence interval methods to effectively limit the number of dataset evaluations. The paper demonstrates robust theoretical results, asserting that this approach can achieve substantial reductions in the number of likelihood evaluations required, particularly under favorable conditions where the Bernstein-von Mises approximation applies accurately. Experimental results illustrate that the suggested enhancements can yield marked reductions in computational complexity, breaking the traditional computational barriers associated with evaluating every data point.

Implications and Future Prospects

The implications of these findings are manifold: MCMC may now be feasible for certain tall data applications where they were previously impractical. The proposed method's reliance on proxies and concentration inequalities provides a promising avenue for future research, potentially paving the way toward general applicability in scenarios where target distributions are not well-approximated by traditional methods, thus broadening the scope for many real-world applications in fields such as genomics and large-scale sensor networks. Additional research may focus on reducing the dependency on strong ergodicity assumptions and extending these techniques to non-Gaussian scenarios, maximizing their utility within the AI community.

In summary, the paper provides valuable insights into the scaling of MCMC methods and offers a promising framework for overcoming some of the key challenges associated with Bayesian inference for tall data, augmenting the efficacy of these methodologies in modern computational statistics.

PDF Markdown