BayesLineFit: Max-Likelihood Orthogonal Regression

Updated 30 December 2025

BayesLineFit is a rigorous method for maximum-likelihood orthogonal regression that achieves unbiased parameter estimation by modeling measurement errors in both axes and accounting for latent scatter.
It employs advanced normalizing flows and variational empirical Bayes to flexibly estimate intrinsic distributions and error correlations, enhancing robustness in under- and overdetermined cases.
The algorithm leverages GPU-accelerated gradient-based optimization with iterative quadrature, delivering superior performance for large noisy datasets and low SNR scenarios.

Maximum-likelihood Orthogonal Regression (BayesLineFit) is a rigorous statistical technique for line fitting where measurement errors occur in both dependent and independent variables. It achieves unbiased parameter estimation by maximizing the likelihood under a joint model that accounts for measurement uncertainties, intrinsic population distributions, and potential latent scatter orthogonal to the regression line. The framework generalizes classic orthogonal regression through maximum-likelihood estimation (MLE), and introduces computational advances via normalizing flows for empirical Bayes modeling of latent variable distributions, robust gradient-based optimization, and efficient iterative solutions for both overdetermined and underdetermined cases (Jing et al., 13 Nov 2024, Guo et al., 15 Jul 2025).

1. Model Formulation and Likelihood Structure

Maximum-likelihood orthogonal regression posits that each observed data point $\left(x_{\rm obs,i}, \sigma_{x,i}, y_{\rm obs,i}, \sigma_{y,i}\right)$ arises from latent true values $(x_i, y_i)$ subject to Gaussian measurement noise

$x_{{\rm obs},i} \sim \mathcal{N}(x_i, \sigma_{x,i}^2), \qquad y_{{\rm obs},i} \sim \mathcal{N}(y_i, \sigma_{y,i}^2).$

A regression law with orthogonal scatter is imposed: $d_i = \frac{y_i - k x_i - b}{\sqrt{1 + k^2}}, \qquad d_i \sim \mathcal{N}(0, \sigma_{\rm int}^2),$ where $k$ is the slope, $b$ the intercept, and $\sigma_{\rm int}$ intrinsic orthogonal scatter. To account for the unobservable intrinsic distributions and correlations between measurement errors and values, population priors $P(x_i, \sigma_{x,i} \mid \phi_x)$ and $P(y_i, \sigma_{y,i} \mid \phi_y)$ are introduced, each parametrized by normalizing flows.

The joint likelihood for point $i$ is: $P_i = \int dx_i \, dy_i \; \mathcal{N}(x_{\rm obs,i} \mid x_i,\sigma_{x,i}^2) \mathcal{N}(y_{\rm obs,i} \mid y_i, \sigma_{y,i}^2) \left[\sqrt{1 + k^2} \mathcal{N} \left( \frac{y_i - k x_i - b}{\sqrt{1 + k^2}} \; | \; 0, \sigma_{\rm int}^2 \right) \right] P(\sigma_{x,i} | x_i; \phi_{x,1}) P(x_i; \phi_{x,2}) P(\sigma_{y,i}|y_i;\phi_{y,1}) P(y_i;\phi_{y,2}).$ Under independence, the total likelihood is $L(\theta, \phi) = \prod_{i=1}^N P_i$ , to be maximized with respect to parameters $\theta = (k, b, \sigma_{\rm int})$ and population parameters $\phi$ (Jing et al., 13 Nov 2024).

2. Variational Empirical Bayes and Latent Population Modeling

Given the latent nature of population distributions $P(x), P(\sigma_x | x)$ and their analogues for $y$ , the methodology implements normalizing flows (NFs) for flexible, nonparametric density estimation. The flows establish a bijection $f_\phi$ from $(x, \sigma_x)$ to a latent space $z$ with a Gaussian base density, translating to: $P_\phi(x, \sigma_x) = P_z(f_\phi(x, \sigma_x)) \cdot |\det \partial f_\phi / \partial (x, \sigma_x)|.$ Parameter estimation for NF components $\phi_x$ and $\phi_y$ proceeds by marginal likelihood maximization over observed $(x_{\rm obs,i}, \sigma_{x,i})$ and $(y_{\rm obs,i}, \sigma_{y,i})$ using variational inference (e.g., EM) and 1D Gauss–Hermite quadrature. The pzflow library employing RollingSplineCoupling layers—two layers, 16 bins, and a log-rescaling input transformation—serves as the reference implementation (Jing et al., 13 Nov 2024).

This empirical Bayes procedure allows the regression model to accommodate arbitrary intrinsic distributions and error-value correlations, enhancing robustness beyond classical parametric assumptions.

3. Numerical Optimization Workflow

Optimization follows a staged procedure:

Preprocessing: Compute medians, rescale and optionally shrink $x$ values for initialization.
NF Training: Pretrain flows for 200 epochs on rescaled data, then maximize marginal-likelihood using Adam with a learning rate of $3\times 10^{-3}$ , full batch training, and early stopping (patience 10 epochs).
MLE Fit of $\theta$ : With flows fixed, initialize $(k,b,\sigma_{\rm int})$ from alternative estimators (ODR, GMM). For each step, numerically compute $P_i(\theta)$ using nested Gauss–Hermite quadrature, sum log-likelihood, backpropagate gradients, and update via Adam until convergence (parameter change threshold $<10^{-5}$ ).
Posterior Sampling (optional): Importance sampling on GPU, drawing from priors and re-weighting by likelihood.

All major computations (NF transforms, quadrature, gradient steps) are GPU-accelerated (Jing et al., 13 Nov 2024). This enables application to large datasets $(N>1000)$ with high computational efficiency.

4. Generalized RV-ML Framework and Convexity Theory

Maximum-likelihood orthogonal regression is a specialization of the more general errors-in-variables problem with Gaussian model uncertainty. The underlying model is $y = (H + E)x + \epsilon$ , where $H$ is a nominal matrix, $E$ is Gaussian noise independent across entries, and $\epsilon$ is output noise. Marginalizing over $E$ and $\epsilon$ leads to the effective noise covariance

$\text{Cov}[\epsilon_{\rm eq}] = \sigma_e^2 \|x\|_2^2 \cdot I_M + \sigma_\epsilon^2 I_M,$

yielding a log-likelihood function for $x$ : $\ell(x) = -\frac{1}{2} (y-Hx)^T [( \sigma_e^2 \|x\|_2^2 + \sigma_\epsilon^2 )I]^{-1} (y-Hx) - \frac{1}{2} M \ln( \sigma_e^2 \|x\|_2^2 + \sigma_\epsilon^2 ).$ Convex reformulation (via SVD and lifting) facilitates the development of the generalized RV-ML (GRV-ML) algorithm, which handles both underdetermined and overdetermined systems with strong duality, improved complexity, and guaranteed global optima (Guo et al., 15 Jul 2025).

5. Closed-form Characterization and Solution Regimes

By examining the KKT conditions of the lifted convex program, all solutions classify into four distinct regimes (A–D) depending on system rank and a scalar $S$ derived from the data, measurement matrix, and noise parameters. Each regime has explicit formulas for solution existence, uniqueness, and multiplicity.

Case A: $R \leq N-1$ , $S \geq 0$ — infinitely many ML solutions.
Case B/D: $R \leq N-1$ or $R=N$ , $S \neq 0$ — unique solution via bisection for scalar $\nu^*$ .
Case C: $R=N$ , $S=0$ — unique closed-form solution.

See the full specification in (Guo et al., 15 Jul 2025) for precise formulas and interval choices. Bisection typically converges within $O(\log(T/\epsilon))$ steps.

6. Specialization to 2D Orthogonal Regression: BayesLineFit Algorithm

For the 2D orthogonal errors-in-variables line fit, the GRV-ML formalism reduces to parameter estimation for line slope $m$ and intercept $c$ based on a noisy set $(x_i, y_i)$ : $m(\nu) = \frac{\sum_{i=1}^M x_i y_i / [1 + 2 \nu \sigma_e^2 x_i^2]}{\sum_{i=1}^M x_i^2 / [1 + 2 \nu \sigma_e^2 x_i^2]}, \qquad c(\nu) = \frac{\sum_{i=1}^M (y_i - m(\nu) x_i) / [1 + 2 \nu \sigma_e^2 x_i^2]}{\sum_{i=1}^M 1 / [1 + 2 \nu \sigma_e^2 x_i^2]}.$ The scalar parameter $\nu$ is solved by bisection to satisfy $g(\nu)=0$ , where $g(\nu)$ is derived from KKT constraints. Iteration proceeds as: select $\nu$ , compute $(m, c)$ , evaluate $g(\nu)$ , and update $\nu$ , converging in typically $10$–$20$ steps. Per-iteration complexity is $O(M)$ (Guo et al., 15 Jul 2025). This method is referred to as BayesLineFit.

7. Statistical Properties, Performance, and Implementation

The method achieves unbiased, self-consistent orthogonal regression that,

Accounts for noise in both axes,
Models arbitrary intrinsic distributions via normalizing flows,
Empirically infers latent error correlations $P(\sigma|x)$ and $P(\sigma|y)$ ,
Leverages GPU acceleration for scalability.

Performance benchmarks (Jing et al., 13 Nov 2024) indicate that with sample sizes $N > 1000$ , both ML-based and KS-test-based methods significantly outperform classical regression in low SNR scenarios, while for $300 < N < 1000$, ML-based approaches provide best results. For $N < 300$ , performance is comparable to existing state-of-the-art. The Python implementation "raddest" utilizing JAX and pzflow will be released. The Cramér–Rao bound for model uncertainty is derived in closed form and reveals that input matrix randomness can be beneficial, especially in underdetermined systems (Guo et al., 15 Jul 2025).

This suggests BayesLineFit is especially suited for modern astronomical datasets with complex distributions and strongly heterogeneous measurement errors, and may generalize to broader inverse problems with errors-in-variables and latent structure.