Consistency of the group Lasso and multiple kernel learning (0707.3390v2)

Published 23 Jul 2007 in cs.LG

Abstract: We consider the least-square regression problem with regularization by a block 1-norm, i.e., a sum of Euclidean norms over spaces of dimensions larger than one. This problem, referred to as the group Lasso, extends the usual regularization by the 1-norm where all spaces have dimension one, where it is commonly referred to as the Lasso. In this paper, we study the asymptotic model consistency of the group Lasso. We derive necessary and sufficient conditions for the consistency of group Lasso under practical assumptions, such as model misspecification. When the linear predictors and Euclidean norms are replaced by functions and reproducing kernel Hilbert norms, the problem is usually referred to as multiple kernel learning and is commonly used for learning from heterogeneous data sources and for non linear variable selection. Using tools from functional analysis, and in particular covariance operators, we extend the consistency results to this infinite dimensional case and also propose an adaptive scheme to obtain a consistent model estimate, even when the necessary condition required for the non adaptive scheme is not satisfied.

Citations (788)

View on Semantic Scholar

Summary

The paper establishes clear conditions for group Lasso consistency by distinguishing strict and weak requirements in both well-specified and misspecified models.
It extends the analysis to multiple kernel learning using covariance operators in RKHS to address infinite-dimensional and nonlinear variable selection.
Adaptive schemes are introduced to adjust regularization weights, enhancing robustness and accuracy in high-dimensional learning applications.

Consistency of the Group Lasso and Multiple Kernel Learning

The paper "Consistency of the Group Lasso and Multiple Kernel Learning" by Francis R. Bach presents a rigorous analysis of the consistency of group Lasso and its extension to multiple kernel learning (MKL) under various practical conditions. This work methodically extends classical results from the Lasso to more complex settings involving groups of variables and reproducing kernel Hilbert spaces (RKHS).

Problem Formulation and Background

The paper begins by addressing the problem of least-square regression with regularization by a block $\ell_1$ -norm, referred to as the group Lasso. This form of regularization priors sparsity at the group level, driving entire groups of coefficients to zero rather than individual coefficients. This method is particularly relevant in situations where predictors naturally cluster into groups, such as different frequency bands in signal processing or various data sources in multi-modal data analysis.

Consistency Results for Group Lasso

The paper establishes necessary and sufficient conditions for the consistency of the group Lasso under both well-specified and misspecified models. Specifically, the strict condition

$\max_{i \in \J^c} \frac{1}{d_i} \left\| \Sigma_{X_i X_\J} \Sigma_{X_\J X_\J}^{-1} (d_j / \|\w_j\|) \wJ \right\| < 1$

is shown to be sufficient for path consistency. In contrast, the weak condition

$\max_{i \in \J^c} \frac{1}{d_i} \left\| \Sigma_{X_i X_\J} \Sigma_{X_\J X_\J}^{-1} (d_j / \|\w_j\|) \wJ \right\| \leq 1$

is necessary. This distinction is crucial, as it highlights scenarios where the group Lasso may fail to identify the correct sparsity pattern, particularly under strong variable correlations.

Extensions to Multiple Kernel Learning

The paper extends the discussion to multiple kernel learning (MKL), a framework that generalizes the group Lasso to RKHS. Here, the consistency conditions are analogous but involve the covariance operators of the kernels. The strict condition becomes

$\max_{i \in \J^c} \frac{1}{d_i} \left\| \Sigma_{X_i X_i}^{1/2} C_{X_i X_\J} C_{X_\J X_\J}^{-1} (d_j / \|\fj \|_{\Fj}) \gJ \right\| < 1,$

with the weak condition being the same inequality with $\leq 1$ . The use of covariance operators allows the results to be applicable to infinite-dimensional spaces, making the findings relevant for a wider range of applications, including non-linear variable selection and learning from heterogeneous data sources.

Adaptive Schemes

To address the limitations of the strict and weak conditions, the paper introduces adaptive versions of the group Lasso and MKL. These adaptive methods adjust the regularization weights based on initial estimates, ensuring consistency even when the initial non-adaptive method fails. This is particularly important for practical applications, providing a guideline for setting weights that adapt to the data, thus improving the robustness and accuracy of the model selection.

Practical Implications and Future Directions

The results have significant implications for the design of machine learning algorithms in high-dimensional settings. By providing clear conditions under which the group Lasso and MKL are consistent, this work aids practitioners in choosing appropriate regularization schemes and understanding the limitations of their models. The introduction of adaptive methods offers a practical solution to ensure consistency, which is crucial in real-world applications where model robustness is paramount.

Future research directions could involve extending these findings to other types of regularization norms and exploring the consistency of generalized linear models. Additionally, the treatment of high-dimensional instances where the number of groups grows with sample size remains a fertile ground for theoretical exploration, potentially impacting the development of scalable algorithms for multi-modal data integration and high-dimensional learning tasks.

In conclusion, this paper makes a substantial contribution to the theoretical understanding of the group Lasso and MKL, providing foundational results that inform both the theory and practice of high-dimensional statistical learning models.

PDF Markdown