Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning (0809.1493v1)

Published 9 Sep 2008 in cs.LG and stat.ML

Abstract: For supervised and unsupervised learning, positive definite kernels allow to use large and potentially infinite dimensional feature spaces with a computational cost that only depends on the number of observations. This is usually done through the penalization of predictor functions by Euclidean or Hilbertian norms. In this paper, we explore penalizing by sparsity-inducing norms such as the l1-norm or the block l1-norm. We assume that the kernel decomposes into a large sum of individual basis kernels which can be embedded in a directed acyclic graph; we show that it is then possible to perform kernel selection through a hierarchical multiple kernel learning framework, in polynomial time in the number of selected kernels. This framework is naturally applied to non linear variable selection; our extensive simulations on synthetic datasets and datasets from the UCI repository show that efficiently exploring the large feature space through sparsity-inducing norms leads to state-of-the-art predictive performance.

Citations (250)

View on Semantic Scholar

Summary

The paper introduces hierarchical multiple kernel learning (HKL) utilizing sparsity-inducing norms to efficiently explore large, structured feature spaces.
Numerical results show that this HKL approach achieves state-of-the-art predictive performance and variable selection accuracy on high-dimensional datasets.
The HKL method offers significant computational efficiency, theoretical consistency guarantees, and provides interpretable models with practical applications in various technical fields.

Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning

The paper authored by Francis Bach focuses on the development and implementation of hierarchical multiple kernel learning (HKL) in the domain of large feature spaces. The research addresses the computational challenges associated with exploring positive definite kernels that span vast and potentially infinite-dimensional feature spaces, by utilizing sparsity-inducing norms such as the $\ell^1$ and block $\ell^1$ -norms.

Overview

The core idea in this research lies in imposing sparsity through norms rather than the more common Euclidean or Hilbertian norms. By structuring these norms in a hierarchical manner, the paper suggests that it is possible to select a subset of kernels from a large sum of individual basis kernels. These kernels are related through a directed acyclic graph (DAG) structure, allowing for kernel selection to be executed efficiently—specifically, in polynomial time relative to the number of selected kernels.

Numerical Results

The paper reports extensive simulations conducted on synthetic datasets and those pulled from the UCI repository. The numerical results indicate that employing sparsity-inducing norms within the HKL framework often yields state-of-the-art predictive performance metrics. These metrics include accuracy in variable selection and prediction, suggesting that the HKL approach is robust in dealing with high-dimensional datasets.

Theoretical Contributions

The paper makes several contributions to the theoretical landscape of kernel learning:

Enhanced Efficiency: The method transforms the problem of handling feature spaces with an exponential number of small kernels into a tractable polynomial-time problem.
Model Consistency: By embedding smaller kernels within a DAG, the research proposes conditions under which the estimated model can consistently predict relevant variables, even when only the hull of the available data is considered.
Regularization Framework: It extends known consistency results of the Lasso method into the HKL framework, thus providing broader insights into model selection properties and predictive consistency.

Practical Implications

The authors highlight the practical applications of the proposed HKL methodology, particularly in fields requiring efficient exploration of large, non-linear feature spaces. The adoption of a sparsity-inducing norm structure can lead to more interpretable models that consistently identify relevant features, which is critical for applications such as bioinformatics, image recognition, and natural language processing.

Future Directions

The paper opens several avenues for future research:

Kernel Extensions: An exploration into different types of kernels, such as string and graph kernels, could extend the applicability of the HKL framework.
Scalability to Larger Datasets: Further optimization of the algorithms could lead to even greater efficiencies, particularly in massively parallel computing environments.
Non-parametric Extensions: Extending the consistency results to non-parametric settings may unlock new potential for adaptive model building.

In conclusion, the paper presents a significant advancement in the field of multiple kernel learning by offering a computationally efficient, theoretically grounded method for dealing with large feature spaces. These innovations hold promise for enhancing both the scale and interpretability of statistical models across a variety of technical domains.

PDF Markdown