Random Features for Kernel Approximation: A Survey on Algorithms, Theory, and Beyond (2004.11154v5)

Published 23 Apr 2020 in stat.ML and cs.LG

Abstract: Random features is one of the most popular techniques to speed up kernel methods in large-scale problems. Related works have been recognized by the NeurIPS Test-of-Time award in 2017 and the ICML Best Paper Finalist in 2019. The body of work on random features has grown rapidly, and hence it is desirable to have a comprehensive overview on this topic explaining the connections among various algorithms and theoretical results. In this survey, we systematically review the work on random features from the past ten years. First, the motivations, characteristics and contributions of representative random features based algorithms are summarized according to their sampling schemes, learning procedures, variance reduction properties and how they exploit training data. Second, we review theoretical results that center around the following key question: how many random features are needed to ensure a high approximation quality or no loss in the empirical/expected risks of the learned estimator. Third, we provide a comprehensive evaluation of popular random features based algorithms on several large-scale benchmark datasets and discuss their approximation quality and prediction performance for classification. Last, we discuss the relationship between random features and modern over-parameterized deep neural networks (DNNs), including the use of high dimensional random features in the analysis of DNNs as well as the gaps between current theoretical and empirical results. This survey may serve as a gentle introduction to this topic, and as a users' guide for practitioners interested in applying the representative algorithms and understanding theoretical results under various technical assumptions. We hope that this survey will facilitate discussion on the open problems in this topic, and more importantly, shed light on future research directions.

Citations (156)

View on Semantic Scholar

Summary

The paper provides an in-depth survey of random features, exploring algorithms, theoretical foundations, and applications for accelerating kernel methods in large-scale problems.
It categorizes random feature algorithms into data-independent methods such as RFF and ORF, and data-dependent approaches that leverage training data for improved feature selection.
The work presents theoretical analysis on the number of features required for kernel approximation and generalization, complemented by rigorous empirical benchmarks on various datasets.

A Survey on Random Features for Kernel Approximation

The paper "Random Features for Kernel Approximation: A Survey on Algorithms, Theory, and Beyond" provides an in-depth exploration of random features, a widely used approach to accelerate kernel methods in large-scale problems. The work is thoroughly organized into segments highlighting the algorithms, theoretical foundations, and applications of random features.

Algorithms

The paper categorizes random feature algorithms into data-independent and data-dependent methods:

Data-independent approaches: These include traditional methods like Random Fourier Features (RFF), which samples from the spectral distribution of a shift-invariant kernel. The survey explores various enhancements over standard RFF, such as methods involving orthogonalization (Orthogonal Random Features, ORF) and structural matrices (Fastfood, and SORF) to reduce variance or complexity. Quasi-Monte Carlo techniques are also evaluated for their efficiency in sampling.
Data-dependent approaches: These methods leverage training data to improve feature selection. Techniques such as Leverage Score Sampling, Kernel Alignment, and Kernel Polarization are discussed, which optimize the choice of features based on data distribution. The authors also explore approaches where the kernel's spectral distribution is learned, potentially allowing for more adaptability and robustness in diverse datasets.

Theoretical Analysis

The theoretical contribution of the paper is substantial, offering a nuanced discussion of how many random features are needed to ensure high-quality kernel approximation and generalization performance. Key findings for data-independent methods demonstrate that, under certain conditions, a reasonable approximation of the kernel matrix and learning performance can be achieved with a minimum number of features.

The paper also extends its discussion to the generalization properties of learning algorithms using random features, examining different loss functions and their dependencies on the number of features and the eigenvalue decay assumptions of the kernel matrix.

Empirical Evaluation

The empirical evaluation section rigorously benchmarks various random feature algorithms on extensive datasets, such as image datasets like MNIST and CIFAR-10, and others. The experiments compare kernel approximation errors, training times, and predictive performance, providing comprehensive insight into the practical implications of using random features with different strategic enhancements.

Implications and Future Directions

The utility of random features in bridging the gap between computational efficiency and the formidable expressive power of kernel methods is of significant importance, particularly in handling large-scale datasets where traditional methods falter. The survey's thoroughness in capturing the breadth of advancements in random feature techniques acts as both a guide for practitioners and a detailed roadmap for researchers.

Notably, while traditional approximations like Nyström have been more widely studied in various contexts, this paper positions random features as a compelling alternative, especially with enhancements like orthogonalization and data-driven methods showing promise for even broader applicability.

The exploration of random features in the context of over-parameterized models and their connections to deep learning indicates fertile ground for further research. As practical implementations steer further into high-dimensional data regimes, understanding the theoretical foundations laid out regarding the efficiency and quality of approximations becomes critically relevant.

In conclusion, the paper successfully captures a decade's worth of progress and future prospects for random features in kernel approximation, encouraging more nuanced research that bridges theory with practice, particularly in emerging domains of machine learning and AI.

Related Papers

Tweets

https://twitter.com/Fanghui_SgrA/status/1806039317493740013

https://twitter.com/firoozye/status/1796287482931392926