Orthogonal Random Features (1610.09072v1)

Published 28 Oct 2016 in cs.LG and stat.ML

Abstract: We present an intriguing discovery related to Random Fourier Features: in Gaussian kernel approximation, replacing the random Gaussian matrix by a properly scaled random orthogonal matrix significantly decreases kernel approximation error. We call this technique Orthogonal Random Features (ORF), and provide theoretical and empirical justification for this behavior. Motivated by this discovery, we further propose Structured Orthogonal Random Features (SORF), which uses a class of structured discrete orthogonal matrices to speed up the computation. The method reduces the time cost from $\mathcal{O}(d^2)$ to $\mathcal{O}(d \log d)$, where $d$ is the data dimensionality, with almost no compromise in kernel approximation quality compared to ORF. Experiments on several datasets verify the effectiveness of ORF and SORF over the existing methods. We also provide discussions on using the same type of discrete orthogonal structure for a broader range of applications.

Citations (207)

View on Semantic Scholar

Summary

The paper introduces ORF, a novel method that replaces Gaussian matrices with scaled random orthogonal matrices to significantly reduce kernel approximation variance.
The paper further presents SORF, a structured variant that reduces computational complexity from O(d²) to O(d log d) while maintaining approximation quality.
These methods enable scalable kernel computations in high dimensions and open avenues for efficient kernel-based machine learning applications.

Orthogonal Random Features: Enhancing Kernel Approximation

This paper introduces the concept of Orthogonal Random Features (ORF), an advancement in the domain of kernel approximation techniques particularly for the Gaussian kernel. The authors propose a method that significantly reduces the kernel approximation error by substituting the conventional random Gaussian matrix in Random Fourier Features (RFF) with a carefully scaled random orthogonal matrix. This observation is substantiated through both theoretical proofs and empirical validation.

Methodological Innovation

Orthogonal Random Features (ORF) substitute the traditional Gaussian transformation matrix $W_{\text{RFF}} = \frac{1}{\sigma} \mathbf{G}$ , where $\mathbf{G}$ is a Gaussian random matrix, with $W_{\text{ORF}} = \frac{1}{\sigma} \mathbf{S} \mathbf{Q}$ , where $\mathbf{Q}$ is uniformly distributed over orthogonal matrices, and $\mathbf{S}$ is a diagonal matrix ensuring scale correctness by sampling from a chi distribution. This modification considerably suppresses variance in the approximation while maintaining unbiasedness.

The second contribution, Structured Orthogonal Random Features (SORF), further enhances computational efficiency by leveraging fast computations through structured matrices, namely, products of Walsh-Hadamard and diagonal matrices. This reduces the computational complexity from $\mathcal{O}(d^2)$ to $\mathcal{O}(d \log d)$ , an improvement substantiated to impose negligible compromise on the approximation quality.

Empirical and Theoretical Insights

Empirically, ORF and SORF demonstrate superior performance across various datasets in terms of mean squared error in kernel approximation. The variance reduction for ORF as a function of input norm $z$ (normalized by kernel bandwidth $\sigma$ ) validates that enforcement of orthogonality results in notably lower errors, especially when measurements lie in the subspace of high density, a common occurrence in data-driven applications.

The paper further delineates that while ORF already shows reduced variance compared to RFF, SORF not only matches the variance reduction properties of ORF but also aligns the bias within acceptable bounds for practical applications. These enhancements potentially enable more scalable and efficient kernel methods without significant trade-offs in performance accuracy.

Implications and Broader Use-Cases

The methodological advancements proposed here have broader implications due to the structured nature of SORF, which lends itself to a variety of applications beyond kernel approximation, including scenarios requiring dimensionality reduction or binary embeddings. The structured orthogonal matrices, with their computational efficiency, represent a step toward reducing the often prohibitive costs associated with high-dimensional kernel approximations.

The paper further postulates that the Hadamard-diagonal structural scheme can generally replace random Gaussian matrices in many applications. This indicates potential for future explorations into more constrained matrices that may retain the orthogonality benefits introduced here.

Speculation on Future Developments

The proposed ORF and SORF provide substantial improvements in kernel methods employed widely within machine learning models, hinting at a fruitful domain for future research. With the rapidly advancing computational capabilities and increasing data dimensionality, efficiently maintaining theoretical properties of kernel methods while enhancing computational tractability remains paramount. Further exploration could entail generalizing these insights to a wider class of kernels and investigating their applicability in novel machine learning paradigms such as those involving large-scale datasets and real-time processing.

In conclusion, the paper presents a substantial methodological advancement in kernel approximation methods via ORF and SORF, offering improved scalability and efficiency without loss of approximation integrity. This furnishes new avenues for researchers and practitioners leveraging kernel-based methods across various data science disciplines.