Multiple Kernel Learning with Group Lasso

Updated 10 January 2026

Multiple Kernel Learning with Group Lasso is a framework that employs group lasso regularization to induce structured sparsity across multiple reproducing kernel Hilbert spaces.
It utilizes a mixed ℓ2,1-norm penalty to automatically select informative kernels, ensuring computational scalability and achieving provable support recovery.
The approach is applied in multi-modal medical diagnostics, computer vision, and natural language processing, leveraging both batch and online optimization techniques.

Multiple kernel learning (MKL) with group lasso regularization refers to a family of convex optimization techniques that combine multiple reproducing kernel Hilbert spaces (RKHS) or feature modalities, inducing sparsity at the kernel or modality level. By formulating the learning problem with a mixed-norm penalty—specifically the $\ell_{2,1}$ norm—this method selects a parsimonious set of active kernels or feature groups, while maintaining statistical and computational tractability. The approach unifies classical group lasso in finite dimensions with its infinite-dimensional RKHS (multiple kernel) analogues, supporting efficient large-scale and online implementations, with provable guarantees for support recovery and consistency.

1. Mathematical Formulation and Equivalence

The general finite-dimensional group lasso problem is defined as minimizing a loss, typically least squares, penalized by a block $\ell_1$ sum-of-norms: $\min_{w \in \mathbb{R}^p} \frac{1}{2n}\|Y - X w\|_2^2 + \lambda \sum_{j=1}^m d_j\|w_j\|_2$ where $w_j$ is a vector corresponding to group $j$ and $d_j > 0$ are weights (0707.3390). This objective extends naturally to nonparametric settings, such as MKL, where each group corresponds to a function $f_j$ in an RKHS $\mathcal{H}_j$ , each with kernel $K_j$ . The primal MKL problem is then

$\min_{f_j \in \mathcal{H}_j,~\beta \ge 0} \sum_{i=1}^n (y_i - \sum_{j=1}^m f_j(x_i))^2 + \lambda \sum_{j=1}^m \beta_j \|f_j\|_{\mathcal{H}_j}^2$

subject to $\sum_j \beta_j \leq 1$ . By variable substitution and eliminating $\beta$ , one can write the equivalent unconstrained problem as a group lasso with the group norm acting on function coefficients: $\min_{w_1,\ldots,w_m} \|y - \sum_j \Phi_j w_j\|_2^2 + \gamma\sum_j \|w_j\|_2$ where $\Phi_j$ is a feature mapping for $K_j$ (Aravkin et al., 2013).

In the explicit feature setting, such as with random Fourier feature (RFF) embeddings for shift-invariant kernels, concatenated RFF blocks for each kernel are penalized by the $\ell_{2,1}$ norm. This leads to group-wise block sparsity and automatic kernel or modality selection (Băzăvan et al., 2012, Liu et al., 2013).

2. Optimization Algorithms and Computational Complexity

The optimization problems arising from MKL with group lasso are convex and, crucially, can be solved efficiently through various algorithms:

Batch solvers: Forward-backward splitting (proximal gradient methods) iteratively apply a gradient step on the smooth loss, followed by a group-wise soft-thresholding operator. The per-iteration cost scales with the number of kernels, sample size, and feature dimensionality (Garrigos et al., 2018, Băzăvan et al., 2012).
Primal formulations with finite features: For RFF-based MKL, the problem reduces to explicit finite-dimensional optimization, sidestepping the $O(N^2)$ scaling of kernel matrix-based methods, yielding $O(N D m)$ cost per iteration (with $N$ samples, $D$ features per kernel, $m$ kernels) (Liu et al., 2013, Băzăvan et al., 2012).
Online and large-scale algorithms: Proximal stochastic gradient algorithms and randomized mirror descent permit MKL with group lasso for extremely large kernel sets—up to $d \sim 10^{12}$ in polynomial kernel families—by exploiting efficient sampling and variance-bounded stochastic gradients (Martins et al., 2010, Afkanpour et al., 2012).
Dual formulations: For square loss, the Fenchel dual admits constraints enforcing group sparsity by bounding the RKHS norm of dual variables per group (Aravkin et al., 2013).

The favorable computational scaling and convexity allow deployment to high-dimensional, large-sample, and high-kernel-cardinality settings with guarantees of global optimality.

3. Statistical Guarantees: Consistency and Support Recovery

MKL with group lasso inherits the statistical properties of the block $\ell_1$ regularization. The necessary and sufficient conditions for exact group selection and asymptotic consistency are formalized via "irrepresentable conditions" on population (or empirical) covariance operators: $\max_{i \notin \mathcal{J}} \frac{1}{d_i}\|\Sigma_{X_i X_\mathcal{J}}\Sigma_{X_\mathcal{J} X_\mathcal{J}}^{-1}(d_j/\|\w_j\|)\w_\mathcal{J}\|_2 < 1$ where $\mathcal{J}$ is the active set of groups. In the RKHS/MKL setting, analogous conditions are formulated using Hilbert-space covariance operators and their correlation analogues (0707.3390).

Under these conditions, and suitable asymptotic scaling of the regularization parameter, group lasso MKL solutions converge in norm to the true function, recover the correct group/model support with probability tending to one, and yield estimators whose risk converges to the minimax rate (0707.3390, Garrigos et al., 2018). Adaptive reweighting schemes—where group penalties are chosen inversely proportional to initial (ridge) group norms—guarantee support recovery without the strong incoherence assumption (0707.3390).

4. Inducing Group Sparsity and Automatic Kernel/Modality Selection

Group lasso induces structured sparsity by penalizing the sum of the Euclidean norms of grouped parameters, promoting block-wise (group) zeros rather than element-wise sparsity as in the standard lasso. In the MKL framework, this mechanism operates at the kernel, group, or feature-modality level. The mixed-norm constraint, for example

$\sum_{l=1}^p \|\beta_l\|_2 \leq 1$

where $\beta_l$ is a vector of kernel weights for group $l$ , forces entire groups to have zero weight unless their collective contribution is sufficiently strong. This enables both automatic selection of informative kernels and the synergistic use of complementary information across groups (Liu et al., 2013, Băzăvan et al., 2012).

The approach is effective in multi-modal applications, such as brain imaging, computer vision, and structured prediction, where multi-source representations must be integrated, and irrelevant sources pruned automatically (Liu et al., 2013, Martins et al., 2010).

5. Practical Implementation and Empirical Results

For high-dimensional and large-sample domains, explicit random feature approximations, such as RFF for Gaussian kernels, enable scalable primal MKL formulations:

Gaussian kernels: $k(x,y) \approx \langle \Psi(x), \Psi(y) \rangle$ with $\Psi(x)$ constructed using sampled frequencies (Fourier basis), yielding $O(1/\sqrt{D})$ approximation error with $D$ features (Băzăvan et al., 2012, Liu et al., 2013).
Experiments (e.g., on ADNI data) demonstrate that RFF + $\ell_{2,1}$ $ℓ_{2, 1}$ -norm MKL achieves higher accuracy (ACC), Matthews correlation coefficient (MCC), and AUC than $\ell_1$ $ℓ_{1}$ or $\ell_2$ $ℓ_{2}$ -based alternatives:
- RFF+ $\ell_{2,1}$ : ACC $87.12\pm 3.37\%$ , MCC $73.30$, AUC $0.952$ (20 random splits, $N=120$ ) (Liu et al., 2013).
- In large-scale vision tasks, RFF-GroupLasso delivers accuracy competitive with classical kernel machines at orders-of-magnitude lower runtime (Băzăvan et al., 2012).

In online and structured prediction scenarios, proximal algorithms yield low regret and provably sparse, interpretable models, efficiently pruning non-informative feature-group templates from hundreds or thousands of possible kernels (Martins et al., 2010).

6. Extensions, Limitations, and Theoretical Insights

Beyond standard convex frameworks, nonconvex generalizations—such as Hyperparameter Group Lasso (HGLasso)—address bias induced by over-shrinkage in convex group penalties, potentially achieving sparser and less biased solutions at the expense of nonconvex optimization (Aravkin et al., 2013). However, convex MKL/group-lasso retains global optimality and scalability advantages.

The main theoretical limitation is the tendency of strict group lasso penalties to exclude modalities with weak but nontrivial signal, which motivates careful group design, possible overlap, or hierarchical variants. The group penalty's effectiveness depends on the degree of complementarity vs. redundancy across kernels.

Support identification under iterative thresholding is formally explained via the notion of mirror stratifiability—the stratification of regularizer subdifferentials guaranteeing that the correct partition of active/inactive groups is discovered in finitely many optimization steps under a qualification condition (Garrigos et al., 2018).

7. Applications and Impact

MKL with group lasso regularization has proven effective in a diverse array of domains:

Multi-modal medical diagnosis: Integration of biomarkers, shape, and regional features in Alzheimer’s classification (Liu et al., 2013).
Computer vision: Automatic kernel selection and large-scale object recognition with RFF+Group Lasso (Băzăvan et al., 2012).
Natural language processing and structured prediction: Sparse selection of informative structure templates in dependency parsing and sequence labeling (Martins et al., 2010).
Large-scale polynomial kernel regression: Efficient learning with exponentially large kernel sets using randomized mirror descent (Afkanpour et al., 2012).

This approach unifies theoretical rigor (statistical consistency, support recovery, and convexity) with scalable computational methodologies, substantially broadening the applicability of kernel-based representations in machine learning.