Scalable Kernel Methods via Doubly Stochastic Gradients (1407.5599v4)

Published 21 Jul 2014 in cs.LG and stat.ML

Abstract: The general perception is that kernel methods are not scalable, and neural nets are the methods of choice for nonlinear learning problems. Or have we simply not tried hard enough for kernel methods? Here we propose an approach that scales up kernel methods using a novel concept called "doubly stochastic functional gradients". Our approach relies on the fact that many kernel methods can be expressed as convex optimization problems, and we solve the problems by making two unbiased stochastic approximations to the functional gradient, one using random training points and another using random functions associated with the kernel, and then descending using this noisy functional gradient. We show that a function produced by this procedure after $t$ iterations converges to the optimal function in the reproducing kernel Hilbert space in rate $O(1/t)$, and achieves a generalization performance of $O(1/\sqrt{t})$. This doubly stochasticity also allows us to avoid keeping the support vectors and to implement the algorithm in a small memory footprint, which is linear in number of iterations and independent of data dimension. Our approach can readily scale kernel methods up to the regimes which are dominated by neural nets. We show that our method can achieve competitive performance to neural nets in datasets such as 8 million handwritten digits from MNIST, 2.3 million energy materials from MolecularSpace, and 1 million photos from ImageNet.

Authors (7)

Bo Dai (245 papers)
Bo Xie (23 papers)
Niao He (91 papers)
Yingyu Liang (107 papers)
Anant Raj (38 papers)
Maria-Florina Balcan (87 papers)
Le Song (140 papers)

Citations (226)

View on Semantic Scholar

Summary

Scalable Kernel Methods via Doubly Stochastic Gradients

The paper presents a novel approach to scaling kernel methods through the development and application of doubly stochastic functional gradients. Traditional kernel methods face computational challenges when applied to large datasets due to the dense nature of kernel matrices, which require significant storage and computation resources. This work introduces a method for overcoming these challenges, effectively enabling kernel methods to compete with neural networks in performance on large-scale data.

Approach

The proposed method hinges on expressing kernel methods as convex optimization problems within reproducing kernel Hilbert spaces (RKHS). By leveraging a doubly stochastic approximation to the functional gradient, kernel methods are scaled using stochastic sampling at two levels: data points and explicit feature maps. This dual approximation involves:

Stochastic Data Sampling: The first level of approximation involves selecting random training data, providing an unbiased gradient during each iteration.
Random Feature Sampling: The second aspect leverages random features associated with the kernel to approximate the kernel function. This auxiliary randomness – managed through pseudo-random number generation – ensures that the function class can adapt and grow dynamically with incoming data.

Theoretical Contributions

Convergence Guarantees: The paper establishes that the proposed algorithm converges to the optimal solution in RKHS with a rate of $O(1/t)$ for iterations and achieves a generalization performance of $O(1/\sqrt{t})$ . Analysis is performed using martingales in Hilbert spaces, underscoring the robustness and reliability of the approach.
Approximation Bounds: The methodology provides novel bounds in terms of approximating the desired functional within RKHS, revealing that the variance introduced by random feature approximation only impacts convergence additively at constant factors rather than the overall rate.

Empirical Validation

The doubly stochastic methodology shows strong empirical results, competing with neural networks in high-dimensional problems such as classification tasks on massive datasets like MNIST, ImageNet, and others. For example, the paper demonstrates competitive performance across datasets with millions of samples (e.g., MNIST with 8 million handwritten digits), where kernel methods traditionally lag due to scalability constraints.

Implications and Future Directions

This work suggests that kernel methods, with their firm theoretical grounding, can rival deep learning models under circumstances where nonparametric inference is essential. The doubly stochastic approach particularly benefits environments where data arrives in a streaming fashion, as it flexibly adapts to novel inputs without significant computational overhead or memory burden.

Future work may leverage this framework to explore adaptive sampling strategies or integrate domain-specific knowledge to optimize the selection of random features further. Moreover, applications across varying domains, such as computational chemistry (demonstrated with datasets like MolecularSpace), highlight the broader implications for scientific data modeling.

This paper punctuates an advance in making kernel methods feasible for large-scale learning tasks, broadening their applicability while maintaining a balance between statistical efficiency and computational scalability.

PDF Markdown