Multi-Task Linear Regression Insights

Updated 5 January 2026

Multi-task linear regression is a framework that fits multiple related linear models using shared regularization to leverage inter-task relationships.
It employs diverse regularizers—including graph-based, block-sparsity, and optimal transport—to enforce similarity, select features, and enable clear structural discoveries.
Applications span time series forecasting, neuroimaging, and distributed networks, demonstrating improved prediction accuracy, sample efficiency, and scalability.

Multi-task linear regression is a paradigm that simultaneously fits multiple related linear predictive models by coupling their parameter estimation through shared regularization or structural constraints. The goal is to leverage relationships among tasks—be they known a priori or inferred from data—to improve generalization, sample efficiency, and interpretability relative to single-task approaches. The multi-task objective is typically formulated as a sum of task-specific prediction losses augmented by penalties or constraints that encode inter-task similarity, feature selection, shared support, or other structural information. Recent advances formalize these couplings via graph-based regularizers, convex clustering, block-sparsity, geometric transport, tensor decompositions, and manifold constraints, enabling broad applicability across regression, classification, online learning, time series forecasting, federated/distributed networks, and functional data analysis.

1. Core Formulations and Regularization Strategies

Canonical multi-task linear regression models parameterize each task $t$ with a vector $w_t$ and design matrix $X_t$ , seeking to minimize

$J(w) = \sum_{t=1}^T \|X_t w_t - y_t\|_2^2 + \lambda R(w)$

where $R(w)$ couples the $w_t$ according to prior knowledge or learned relationships.

Graph-based regularization. In "Online Multi-Task Learning with Recursive Least Squares and Recursive Kernel Methods" (Lencione et al., 2023), $R(w)$ is a quadratic penalty

$R(w) = \sum_{t=1}^T \sum_{j \in E_t} \| w_t \cdot \mathrm{sim}(t,j) - w_j \cdot \mathrm{sim}(j,t) \|_2^2 + \gamma \sum_t \|w_t\|_2^2$

which is equivalently written as $w^\top(A \otimes I_d)w$ for a graph Laplacian $A$ , enforcing smoothness of parameter vectors over the task-relationship graph.

Task-feature prior and group sparsity. The formulation in "Multi-Task Learning with Prior Information" (Zhang et al., 2023) unifies data fit, group-sparse feature selection via $w_t$ 0 norm, prior-encoded feature constraints $w_t$ 1, and temporal or spatial task-smoothness:

$w_t$ 2

yielding a convex but nonsmooth program.

Block- $w_t$ 3 regularization. Multi-task Lasso (multi-response regression) enforces joint support recovery using

$w_t$ 4

which promotes row-sparsity in $w_t$ 5—i.e., shared selection of features across tasks (Wang et al., 2013).

Cross-learning and pairwise constraints. Cerviño et al. (Cervino et al., 2020) introduce

$w_t$ 6

where the pairwise penalty corresponds to a complete-graph Laplacian, admitting closed-form solutions and efficient projected SGD/dual projection algorithms.

2. Structural Learning: Task Relationships, Clustering, and Precision Graphs

Sparse precision estimation. The Multi-task Sparse Structure Learning (MSSL) model (Goncalves et al., 2014) jointly estimates the parameter matrix $w_t$ 7 and the task precision matrix $w_t$ 8:

$w_t$ 9

recovering both task clusters/structure and parameter sparsity via alternating minimization (proximal gradient or ADMM for $X_t$ 0, graphical lasso for $X_t$ 1).

Convex hierarchical clustering. In (Yu et al., 2017) and (Okazaki et al., 2023), convex clustering penalties such as $X_t$ 2 (where $X_t$ 3 are cluster-centroid parameters) induce hierarchical partitions of tasks. Solutions track cluster fusion as the penalty parameter increases, achieving interpretable grouping and flexible sharing.

3. Advanced Geometric and Tensor Regularization

Geometry-aware regularization. "Wasserstein regularization for sparse multi-task regression" (Janati et al., 2018) implements coupling via optimal transport between the (absolute values of) regression coefficients, allowing supports to align according to arbitrary feature geometry (distance matrix $X_t$ 4), without requiring overlap:

$X_t$ 5

where $X_t$ 6 is the unbalanced entropic optimal transport divergence.

Tensorized multi-modal regression. The tLSSVM-MTL framework (Liu et al., 2023) leverages a CP decomposition over a high-order weight tensor, coupling via shared and task-specific latent factors across indexing modes. Each block update solves a linear system enforcing multilinear relationships, accommodating general multimodal data.

4. Online, Distributed, and Asynchronous Multi-Task Regression

Online recursive multi-task RLS. The MT-WRLS algorithm (Lencione et al., 2023) achieves exact and immediate updates for

$X_t$ 7

with per-instance complexity $X_t$ 8, outperforming cubic-cost ADMM or suboptimal OGD schemes.

Distributed networked multi-task learning. In (Hong et al., 2024), nodes estimate local linear models $X_t$ 9, subject to in-group consensus penalties and cross-group precision coupling:

$J(w) = \sum_{t=1}^T \|X_t w_t - y_t\|_2^2 + \lambda R(w)$ 0

Convergence holds under bounded noise and strong convexity; asynchronous two-timescale updates minimize communication overhead and scale to heterogeneous, federated settings.

5. Statistical Guarantees, Sample Complexity, and Hyperparameter Selection

Sharp threshold for support recovery. (Wang et al., 2013) establishes that, under block- $J(w) = \sum_{t=1}^T \|X_t w_t - y_t\|_2^2 + \lambda R(w)$ 1 multi-task Lasso, the necessary and sufficient sample size for recovery of the support union $J(w) = \sum_{t=1}^T \|X_t w_t - y_t\|_2^2 + \lambda R(w)$ 2 is

$J(w) = \sum_{t=1}^T \|X_t w_t - y_t\|_2^2 + \lambda R(w)$ 3

with $J(w) = \sum_{t=1}^T \|X_t w_t - y_t\|_2^2 + \lambda R(w)$ 4 quantifying the joint sparsity and design structure. When support is shared, sample complexity per task drops by a factor $J(w) = \sum_{t=1}^T \|X_t w_t - y_t\|_2^2 + \lambda R(w)$ 5 compared to single-task Lasso.

Consistent regularization parameter estimation. Random matrix theory analysis (Ilbert et al., 2024) gives closed-form expressions for asymptotic training and test errors. The optimal regularization parameter $J(w) = \sum_{t=1}^T \|X_t w_t - y_t\|_2^2 + \lambda R(w)$ 6 depends on signal-to-noise ratio and can be rigorously estimated from training data under high-dimensional limits.

6. Algorithmic Solutions and Convergence Properties

Optimization schemes include:

Recursive updates (MT-WRLS, block-coordinate descent).
Proximal-gradient and accelerated ISTA/FISTA variants for nonsmooth convex programs (Zhang et al., 2023).
Alternating minimization for joint estimation of parameter and precision/task-structure matrices (Goncalves et al., 2014).
Block-coordinate descent with ADMM/proximal subsolvers for convex clustering (Okazaki et al., 2023).
Efficient Sinkhorn iterations for OT-based couplings (Janati et al., 2018).
Primal-dual projection for large-task constrained SGD (Cervino et al., 2020). Convergence guarantees depend on convexity, regularity, and strong convexity of the underlying loss and penalty structure. Empirical and theoretical analyses confirm monotonic decrease of objectives, recovery of sparsity, and, in most cases, stationarity or global optimality when the problem is jointly convex.

7. Applications and Empirical Performance

Examined use cases include:

Wind speed time series forecasting—MT-WRLS and MT-OSLSSVR achieve ≈16% reduction in relative RMSE compared to single-task RLS and outperform cubic-cost batch solvers (Lencione et al., 2023).
School exam scores and Sarcos datasets—prior-encoded and smoothness-augmented models significantly improve explained variance and normalized MSE (Zhang et al., 2023).
Climate model combination—task-structure learning yields block-diagonal dependency graphs and better RMSE than Laplacian smoothing (Goncalves et al., 2014).
Remote-sensing-based trait prediction and GWAS—tree-structured MTL outperforms pre-grouped and no-group baselines, recovers meaningful hierarchy, and links genomic loci to structured outputs (Yu et al., 2017).
High-dimensional neuroimaging—Wasserstein regularization adapts to anatomical geometry, yielding interpretable, spatially-coherent supports (Janati et al., 2018).
Distributed student performance modeling—DAMTL converges efficiently on real district data, robust to asynchronous updates (Hong et al., 2024).
In-context multi-task learning with transformers—multi-head architectures implement debiased gradient descent in superposition, generalizing empirically to longer contexts and overlapping task supports (He et al., 17 Mar 2025).

In summary, multi-task linear regression constitutes a mature and rapidly evolving class of models capable of exploiting diverse forms of inter-task dependency: explicit graphs, latent clusters, block-sparsity, feature geometry, tensor factorization, and network topology. These approaches offer provable statistical and computational efficiency, interpretable structural discovery, and flexibility in online, distributed, and high-dimensional regimes.