L2 Regularization for Learning Kernels (1205.2653v1)

Published 9 May 2012 in cs.LG and stat.ML

Abstract: The choice of the kernel is critical to the success of many learning algorithms but it is typically left to the user. Instead, the training data can be used to learn the kernel by selecting it out of a given family, such as that of non-negative linear combinations of p base kernels, constrained by a trace or L1 regularization. This paper studies the problem of learning kernels with the same family of kernels but with an L2 regularization instead, and for regression problems. We analyze the problem of learning kernels with ridge regression. We derive the form of the solution of the optimization problem and give an efficient iterative algorithm for computing that solution. We present a novel theoretical analysis of the problem based on stability and give learning bounds for orthogonal kernels that contain only an additive term O(pp/m) when compared to the standard kernel ridge regression stability bound. We also report the results of experiments indicating that L1 regularization can lead to modest improvements for a small number of kernels, but to performance degradations in larger-scale cases. In contrast, L2 regularization never degrades performance and in fact achieves significant improvements with a large number of kernels.

Citations (419)

View on Semantic Scholar

Summary

The paper's main contribution is the derivation of tighter stability bounds for kernel ridge regression using L2 regularization, adding an O(√(p/m)) term.
The empirical analysis shows that L2 regularization consistently outperforms L1, especially in large-scale, multi-kernel regression tasks.
The study underscores the practical benefits of L2 regularized kernel learning in enhancing computational stability and reducing overfitting in regression models.

An Analysis of L2 Regularization for Learning Kernels in Regression Tasks

The paper by Cortes et al. presents an in-depth analysis of the use of L2 regularization in the context of learning kernels for regression problems. The focus on kernel methods, particularly in the framework of kernel ridge regression (KRR), provides insights into how L2 regularization can improve model performance compared to traditional L1 regularization. The analysis is grounded in both theoretical constructs and empirical evaluation, reflecting a comprehensive approach to understanding kernel learning.

Theoretical Contributions

A significant theoretical contribution of the paper is the derivation of new stability bounds for KRR when L2 regularization is employed. The authors argue that the choice of kernel is crucial for the success of kernel-based learning algorithms, yet this choice is typically left to the practitioner's discretion. By employing L2 regularization, they propose a methodology where the kernel is learned from data, selected from a family of kernels defined as non-negative linear combinations of base kernels. The stability analysis leads to a novel bound that includes an additive term O(√(p/m)), where p is the number of kernels and m is the sample size. This contrasts favorably with the multiplicative complexity factors seen in previous bounds using L1 regularization.

One of the key aspects of the theoretical framework is the assumption that the base kernels involved are orthogonal. Under this assumption, the paper derives a generalization bound with the complexity term augmented only by an additive factor, avoiding logarithmic complexities. This analysis provides a tighter uniform stability bound compared to prior works, suggesting that L2 regularization provides more reliable guarantees on the estimation error, especially as the number of kernels increases.

Empirical Analysis

The experimental results presented in the paper provide empirical evidence supporting the theoretical claims. The authors conduct experiments using a range of datasets, including those from the UCI Machine Learning Repository and domain-specific tasks. Their findings indicate that L2 regularization consistently outperforms L1 regularization, particularly as the number of kernels increase. In large-scale scenarios, L2 regularization not only avoids performance degradation but also achieves substantial improvements over the baseline methods. These results are consistent across various tasks, validating the robustness of the proposed approach.

Practical Implications and Future Directions

The demonstration that L2 regularization for learning kernels can significantly enhance model performance has important practical implications. In scenarios where computational resources and data availability allow for the exploration of a large number of kernels, L2 regularization offers a compelling advantage in maintaining computational stability and avoiding overfitting tendencies commonly associated with L1 regularization.

Looking forward, the results suggest several avenues for further research. Extension of the analysis to non-orthogonal kernel sets could provide a broader applicability of the theoretical findings. Furthermore, exploring the integration of L2 regularized kernel learning within other machine learning paradigms, such as deep learning, could leverage its stability benefits in complex architectures. Moreover, the iterative algorithm proposed for solving the L2 regularized kernel learning problem offers potential for optimization and further efficiency improvements.

In conclusion, the paper establishes a firm ground for the application of L2 regularization in learning kernels and opens up promising directions for research on effective methods of kernel selection in various machine learning contexts.

PDF Markdown