Unlinked Linear Regression (ULR)

Updated 27 July 2025

ULR is a regression framework where covariates and responses are unpaired, leading to unique challenges in identifiability and statistical inference.
The methodology employs deconvolution, moment-based loss minimization, and techniques like concave minimization and algebraic-geometric initialization to estimate parameters.
ULR is pivotal in applications such as privacy-preserving analysis, ecological regression, sensor fusion, and multi-target tracking, highlighting both theoretical and practical complexities.

Unlinked Linear Regression (ULR) refers to a regression setting where the fundamental pairing between covariates and responses is missing, such that instead of observing tuples (X, Y), one only observes the marginal distribution of X and the marginal distribution of Y (Balabdaoui et al., 20 Jul 2025). This paradigm encompasses problems commonly called “unlinked,” “unmatched,” or “regression without correspondence,” and arises in numerous areas including privacy-preserving statistical analysis, ecological regression, sensor fusion, multi-target tracking, and more. The methodological and theoretical challenges in ULR stem principally from issues of non-identifiability, increased statistical hardness, and algorithmic intractability relative to classical regression models.

1. Model Formulation and Fundamental Problem Structure

The canonical ULR model assumes that the observed datasets consist of two independently collected samples: $\mathcal{X}_n = \{X_1, \dots, X_n\}$ from a distribution $\mu_X$ and $\mathcal{Y}_n = \{Y_1, \dots, Y_n\}$ from a distribution $\mu_Y$ , with $\mu_Y$ the distribution of $Y = \beta_0^\top X + \varepsilon$ under some true but unknown parameter vector $\beta_0\in\mathbb{R}^d$ (Balabdaoui et al., 20 Jul 2025, Azadkia et al., 2022, Durot et al., 14 Apr 2024). The pairing of each $X_i$ to each $Y_j$ is unobserved, eliminating any information about joint samples. As a result, ULR is formally set up as a marginal problem:

$Y \sim_d \beta_0^\top X + \varepsilon,$

with X and Y unpaired, $\varepsilon$ independent noise, and interest lying in identifying or estimating $\beta_0$ .

A central question is identifiability: for which distributions of $X$ and $\varepsilon$ is $\beta_0$ uniquely determined (up to equivalence) by knowledge of the marginal distributions of $X$ and $Y$ alone (Balabdaoui et al., 20 Jul 2025)?

Compared to “shuffled regression”—in which the (X, Y) pairs are observed up to an unknown permutation—ULR is strictly harder: information about the pairing is completely erased (Durot et al., 14 Apr 2024, Azadkia et al., 2022).

2. Identifiability Theory

Identifiability in ULR is typically characterized by the solution set:

$\mathscr{B}_0 = \{\beta\in \mathbb{R}^d: \beta^\top X \sim_d \beta_0^\top X\},$

i.e., the collection of all parameter vectors that, when projected onto $X$ , generate the same marginal law for $Y$ as $\beta_0$ (Balabdaoui et al., 20 Jul 2025, Azadkia et al., 2022).

When the components of $X$ are i.i.d. standard normal (or, more generally, spherically symmetric), $\mathscr{B}_0$ forms a sphere:

$\mathscr{B}_0 = \{\beta \in \mathbb{R}^d: ||\beta|| = ||\beta_0||\},$

so only the norm of $\beta_0$ is identifiable, not its direction (Balabdaoui et al., 20 Jul 2025, Azadkia et al., 2022). For elliptically symmetric $X$ with mean $\mu$ and covariance $\Sigma$ , one obtains

$\mathscr{B}_0 = \{\beta: \beta^\top \mu = c,\, ||\Sigma^{1/2} \beta|| = \rho\}.$

When $X$ 's components are independent but not identically distributed, finer identifiability obtains, often up to sign changes and/or coordinate permutations, especially if components belong to certain scale families or possess distinct higher moments (Balabdaoui et al., 20 Jul 2025). In dimension $d=2$ , fourth-moment methods can (under nondegeneracy) restrict the solution set to at most 8 points distinguished by sign flips.

A critical general result is that non-Gaussian, non-exchangeable designs tend to admit stronger identifiability—sometimes uniquely up to signed permutation—whereas i.i.d. or Gaussian designs yield large, non-identifiable solution sets. However, identifiability can also be destroyed by more intricate symmetries, such as when one component is a sum/convolution of others (Balabdaoui et al., 20 Jul 2025).

The connection with Independent Component Analysis (ICA) is direct: in the multi-response ULR with independent X components, the model is mathematically equivalent to ICA with additive noise, and identifiability is thus tightly linked to the classic ICA ambiguity (sign and permutation of sources) (Balabdaoui et al., 20 Jul 2025).

3. Statistical and Computational Rates

Statistical estimation in ULR is typically pursued via deconvolution-based methods or estimators that minimize a moment-based or minimum-contrast loss (e.g., matching the law of $Y$ to the convolution of the transformed covariates with the noise law) (Azadkia et al., 2022, Balabdaoui et al., 2020).

Minimax optimal rates for ULR estimation (in, e.g., Wasserstein or $L_1$ risk, for monotone or affine link functions) critically depend on the noise smoothness and the sample size:

For supersmooth error (e.g., Gaussian, Fourier transform decays exponentially, $\beta=2$ ):

$\text{Risk} \asymp \sigma_n (\log n)^{-1/\beta} + n^{-1/2}.$

The $n^{-1/2}$ rate is only achieved when the noise standard deviation $\sigma_n$ is below a threshold, otherwise (i.e., high or moderately decaying noise) the first (logarithmic deconvolution) term dominates (Durot et al., 14 Apr 2024). This phase transition is absent in ordinary regression.

Shuffled regression vs. ULR: In the small noise regime ( $\sigma_n \lesssim n^{-1/2}$ ), shuffled regression (where only correspondence is hidden) enjoys strictly better minimax risk than ULR, which demonstrates fundamentally harder estimation in the fully unlinked case (Durot et al., 14 Apr 2024).

Furthermore, signal-to-noise lower bounds demonstrate inherent statistical hardness in the unlinked scenario. For Gaussian covariates, one must have $\text{SNR} \gtrsim d/\log\log n$ to permit approximate recovery, which is significantly higher than the classical regression threshold ( $\text{SNR} \sim d/n$ ) (Hsu et al., 2017).

4. Algorithmic Approaches

A diverse range of algorithmic frameworks has been developed for ULR and related settings:

Deconvolution Least Squares Estimator (DLSE): Estimates $\beta$ by minimizing a squared Wasserstein (or $L_2$ ) discrepancy between $F_n^Y$ and the convolution $(1/n)\sum_{i=1}^n F^\epsilon(y - \beta^\top X_i)$ (Azadkia et al., 2022). This approach achieves consistency and asymptotic normality up to identifiability equivalence; when identification is partial (e.g., up to norm or permutation), the estimator reliably recovers that feature.
Semi-supervised enhancement: Incorporates a small set of paired (X, Y) observations to resolve indeterminacy in direction by using the (matched) OLS estimator for orientation and the (unmatched) DLSE for norm calibration (Azadkia et al., 2022).
Mixture and Optimal Transport Approaches: For multivariate or monotone unlinked regression, methods combining the Kiefer–Wolfowitz NPMLE for deconvolving the mixture distribution with optimal transport (to couple the deconvolved mixing distribution with observed covariates) are effective, yielding explicit mean-squared error bounds and computational scalability (Slawski et al., 2022).
Concave minimization (branch and bound): Reformulation of the correspondence-free maximum likelihood estimation into a concave minimization problem over a compact, computable, low-dimensional space, enabling tractable global solutions for moderate dimensions (up to 8) (Peng et al., 2020).
Algebraic-geometric initialization: For small $n$ , approaches exploiting symmetric polynomials and algebraic geometry to construct permutation-invariant constraints, leading to well-conditioned polynomial systems whose real roots are candidate regression vectors (Tsakiris et al., 2018).

Algorithmic tractability is generally possible only in low dimensions or under additional structure (e.g., monotonicity, positive definiteness, or independence across features). Most general ULR problems are NP-hard even to approximate well (Hsu et al., 2017, Peng et al., 2020).

ULR is closely related to several statistical models:

Shuffled regression: Observes all (X, Y) pairs up to an unknown permutation; intermediate in difficulty between classical and unlinked regression. Under small noise, identification is nearly as strong as with linked pairs, yielding faster rates (Durot et al., 14 Apr 2024).
Deconvolution: Seeks to recover $Z = m_0(X)$ or $Z = \beta_0^\top X$ from observations $Y = Z + \varepsilon$ . ULR is a mixture deconvolution problem when marginalizing over $X$ (Azadkia et al., 2022, Durot et al., 14 Apr 2024).
Independent Component Analysis (ICA): As discussed, ULR with multi-response or vector outputs and independent X components coincides structurally with ICA, with robust identifiability guarantees (up to sign and permutation) if components are non-Gaussian (Balabdaoui et al., 20 Jul 2025).

The identification, estimation, and algorithmic principles of ULR can thus inform approaches in blind source separation, privacy-preserving data fusion, and unsupervised or semi-supervised learning regimes where direct correspondence is not available.

6. Open Problems and Research Directions

ULR presents a number of open theoretical and practical challenges:

General identifiability characterization: While specialized identifiability results exist for non-Gaussian, independent components (often up to signed permutations), comprehensive theory for the general non-i.i.d., non-symmetric case remains elusive. Utilization of higher-order moments or cumulant tensors, in the spirit of modern ICA, is a promising direction (Balabdaoui et al., 20 Jul 2025).
Statistical-computational tradeoffs: While NP-hardness holds in the worst case, average-case and conditional tractability (for example, under Gaussian designs or separation conditions) are only partially understood.
Rates in non-supersmooth noise: Existing minimax rate results focus on supersmooth errors; extension to ordinary smooth or discrete error laws remains open (Durot et al., 14 Apr 2024).
Robustness and practical algorithms: Algorithm development for high-dimensional, nonparametric, or semiparametric ULR settings with realistic noise, missing data, and outliers (beyond the standard Gaussian or fully observed settings) is ongoing.
Extension to multivariate and generalized nonlinear link functions: While affine and monotone links have been studied, more general convex/monotone operator settings (e.g., optimal transport–identified maps) are only beginning to be explored (Slawski et al., 2022).
Le Cam equivalence: Whether deconvolution, unlinked regression, and shuffled regression are asymptotically equivalent statistical experiments remains unresolved (Durot et al., 14 Apr 2024).

7. Practical Implications and Applications

The ULR framework is essential in multiple modern applications:

Ecological and aggregate regression: Used when linking between group-level covariates and outcome summaries is fundamentally missing (Durot et al., 14 Apr 2024).
Privacy-preserving and federated estimation: Enables learning when data custodians only provide marginal summaries for privacy reasons.
Multi-target tracking and signal registration: Occurs when sensor and target data lack synchrony or reliable indexing (Durot et al., 14 Apr 2024).
Computational neuroscience: Estimating transformations or matches between two sets of spatial or functional data without known pairings (e.g., neuron matching in C. elegans) (Nejatbakhsh et al., 2019).

In these settings, understanding identifiability, reliable estimation, and inherent hardness is critical for both the design and theoretical justification of analytic pipelines.

Unlinked Linear Regression is thus an archetype of modern statistical problems marked by missing pairwise linkage, challenging both the theory and practice of inference. The current literature provides partial solutions—often up to a set of equivalent parameters—under a tapestry of assumptions on design, noise, and structure. Further research, drawing on ideas from deconvolution, convex geometry, algebraic geometry, and ICA, continues to expand the scope and tractability of statistical inference in this setting.