Speeding Up MCMC by Efficient Data Subsampling (1404.4178v6)

Published 16 Apr 2014 in stat.ME, stat.CO, and stat.ML

Abstract: We propose Subsampling MCMC, a Markov Chain Monte Carlo (MCMC) framework where the likelihood function for $n$ observations is estimated from a random subset of $m$ observations. We introduce a highly efficient unbiased estimator of the log-likelihood based on control variates, such that the computing cost is much smaller than that of the full log-likelihood in standard MCMC. The likelihood estimate is bias-corrected and used in two dependent pseudo-marginal algorithms to sample from a perturbed posterior, for which we derive the asymptotic error with respect to $n$ and $m$, respectively. We propose a practical estimator of the error and show that the error is negligible even for a very small $m$ in our applications. We demonstrate that Subsampling MCMC is substantially more efficient than standard MCMC in terms of sampling efficiency for a given computational budget, and that it outperforms other subsampling methods for MCMC proposed in the literature.

Citations (169)

View on Semantic Scholar

Summary

The paper presents a novel subsampling MCMC framework that integrates control variates and a correlated pseudo-marginal scheme to enhance computational efficiency in Bayesian analysis.
It employs unbiased log-likelihood estimators via Taylor expansions around reference parameters and data centroids to maintain minimal posterior error.
Empirical results show significantly improved sampling efficiency over standard MCMC methods, making it ideal for large-scale and computationally intensive applications.

Efficient Data Subsampling for Accelerating MCMC: A Detailed Analysis

The paper "Speeding Up MCMC by Efficient Data Subsampling" by Matias Quiroz, Robert Kohn, Mattias Villani, and Minh-Ngoc Tran presents a novel framework for implementing Markov Chain Monte Carlo (MCMC) methods with improved computational efficiency via data subsampling. This approach is particularly relevant for Bayesian inference in large data contexts where traditional MCMC methods struggle with scalability due to the computational burden of evaluating likelihoods based on the full dataset.

The core contribution of this paper is the development of a Subsampling MCMC framework that seeks to approximate the likelihood for a large dataset by employing only a small random subset of the data at each iteration. Two key innovations underlie this framework: the use of a highly efficient unbiased estimator of the log-likelihood with control variates and the implementation of a correlated pseudo-marginal scheme to enhance the acceptance rates.

Main Contributions and Methodology

The authors leverage control variates to substantially reduce the variance of the log-likelihood estimator. Two types of control variates are presented:

Parameter Expanded Control Variates: These employ a Taylor expansion around a reference parameter value to approximate individual log-likelihood contributions.
Data Expanded Control Variates: These rely on a Taylor expansion around the nearest data centroid, allowing for efficient computation by clustering data and evaluating likelihoods at cluster centroids.

By utilizing these control variates, the Subsampling MCMC framework effectively targets a perturbed posterior distribution with minimal error. The paper rigorously establishes that the total variation norm of the error in the posterior diminishes rapidly even as the size of the subsample remains small, particularly when the Maximum Likelihood Estimate (MLE) from the full dataset is known.

Key Findings and Results

Empirical evaluations demonstrate the remarkable improvements in computational efficiency. Specifically, when employing Subsampling MCMC, sampling efficiency significantly surpasses that of standard MCMC. The efficiency gains are most pronounced in scenarios involving substantial datasets due to the subsampling method's ability to maintain a bounded variance in the log-likelihood estimates, which is crucial for effective pseudo-marginal MCMC methodologies.

Interestingly, this framework not only surpasses traditional MCMC in performance but also outperforms existing subsampling methods. The theoretical analysis corroborates these empirical findings, depicting the reduced error rates via extensive statistical derivations. For instance, with a subsample size $m = O(n^{1/2})$ , where $n$ is the full data size, the framework achieves a significantly reduced perturbation error rate in the perturbed posterior.

Implications and Future Directions

The implications of this work are notable for fields necessitating Bayesian inference on large datasets, such as genetics, rare event simulation, and financial econometrics. By making MCMC more computationally manageable, this method could potentially overhaul the computational processes in practical applications that require rapid iterative sampling from posterior distributions.

Moving forward, the paper's insights into correlating pseudo-marginal variates suggest potential avenues in optimizing MCMC convergence rates further. The authors point towards applying these methods to models with expensive individual likelihood computations, broadening the applicability of Subsampling MCMC.

This work is a significant step toward enabling scalable Bayesian computation, offering robust theoretical underpinnings and demonstrating practical improvements across various empirical benchmarks. Future research might explore adapting this framework for even broader classes of models and consider its integration with advanced proposal mechanisms such as those found in Gibbs sampling and Langevin dynamics.