Expectation Maximization (EM) Converges for General Agnostic Mixtures

Published 7 Apr 2026 in cs.LG, cs.IT, and stat.ML | (2604.05842v1)

Abstract: Mixture of linear regression is well studied in statistics and machine learning, where the data points are generated probabilistically using $k$ linear models. Algorithms like Expectation Maximization (EM) may be used to recover the ground truth regressors for this problem. Recently, in \cite{pal2022learning,ghosh_agnostic} the mixed linear regression problem is studied in the agnostic setting, where no generative model on data is assumed. Rather, given a set of data points, the objective is \emph{fit} $k$ lines by minimizing a suitable loss function. It is shown that a modification of EM, namely gradient EM converges exponentially to appropriately defined loss minimizer even in the agnostic setting. In this paper, we study the problem of \emph{fitting} $k$ parametric functions to given set of data points. We adhere to the agnostic setup. However, instead of fitting lines equipped with quadratic loss, we consider any arbitrary parametric function fitting equipped with a strongly convex and smooth loss. This framework encompasses a large class of problems including mixed linear regression (regularized), mixed linear classifiers (mixed logistic regression, mixed Support Vector Machines) and mixed generalized linear regression. We propose and analyze gradient EM for this problem and show that with proper initialization and separation condition, the iterates of gradient EM converge exponentially to appropriately defined population loss minimizers with high probability. This shows the effectiveness of EM type algorithm which converges to \emph{optimal} solution in the non-generative setup beyond mixture of linear regression.

Abstract PDF Upgrade to Chat

Authors (1)

Avishek Ghosh

Summary

The paper demonstrates global exponential convergence of the gradient EM algorithm in agnostic mixture models, relaxing standard generative assumptions.
It employs a resampling method to compute weighted gradient steps on a soft-min loss, ensuring robustness with strongly convex and smooth losses.
The analysis broadens prior mixed linear regression results, offering insights for robust unsupervised learning on heterogeneous data.

Exponential Convergence of Expectation Maximization for General Agnostic Mixtures

Introduction

The paper "Expectation Maximization (EM) Converges for General Agnostic Mixtures" (2604.05842) rigorously analyzes the convergence behavior of the gradient EM algorithm for fitting general mixtures of parametric models in fully agnostic settings. Unlike traditional mixture modeling, where data is assumed to be generated from a mixture of known generative models (e.g., Gaussian Mixture Models, Mixed Linear Regression), the agnostic setting places no assumptions on the data-generating process. Instead, the goal is to fit $k$ functions by minimizing a suitable empirical (population) loss. This work extends prior agnostic results on mixed linear regression to general parametric models with strongly convex and smooth loss functions, removing restrictive distributional assumptions and broadening the applicability of EM-type algorithms in unsupervised and robust machine learning contexts.

Problem Formulation

The task is to recover $k$ parameterized functions $f_{\theta_j}(x)$ from a dataset $\{(x_i, y_i)\}_{i=1}^n$ , without a generative assumption tying $y_i$ to $x_i$ and the true parameters. The fitting objective is based on the minimization of a soft-min loss over $k$ parameters: $\ell(\theta_1, \ldots, \theta_k; x, y) = \sum_{j=1}^k p_{\theta_1,..,\theta_k}(x, y; \theta_j)\, F(x, y; \theta_j),$ where

$p_{\theta_1,..,\theta_k}(x, y; \theta_j) = \frac{\exp(-\beta F(x, y; \theta_j))}{\sum_{l=1}^k \exp(-\beta F(x, y; \theta_l))}$

and $F$ is any base loss that is $k$ 0-smooth and $k$ 1-strongly convex with respect to $k$ 2. This formulation generalizes various mixture modeling scenarios:

Regularized mixed linear regression,
Mixtures of linear classifiers (e.g., logistic regression, SVMs),
Mixtures of generalized linear models.

Algorithm Description

The paper focuses on gradient EM (as opposed to classical EM) applied to the empirical soft-min loss. Each iteration consists of:

Expectation (Soft Assignment): For each data point, compute the soft assignments (i.e., responsibilities) based on the current parameter estimates.
Maximization (Gradient Step): Update each parameter by a weighted gradient step on the base loss, where the weights correspond to the computed soft assignments.

Crucially, to make the empirical analysis tractable and avoid statistical dependencies between iterations, the analysis employs a resampling / sample-splitting approach: at each iteration, fresh samples are used to compute the gradient.

Main Theoretical Results

The primary contribution is a proof of global exponential convergence (up to an error floor) of the gradient EM algorithm under the following minimal assumptions:

The base loss $k$ 3 is $k$ 4-smooth and $k$ 5-strongly convex.
The initial parameter estimates are within a $k$ 6 neighborhood (in $k$ 7 norm) of loss minimizers.
A separation condition holds in the loss landscape, ensuring that each fitted function provides locally better predictions on a non-trivial subset of the input space (of measure at least $k$ 8), and the minimizers are not degenerate.
No distributional assumptions on $k$ 9 or $f_{\theta_j}(x)$ 0 (beyond independence).

The main guarantee shows that after $f_{\theta_j}(x)$ 1 iterations, the parameter iterates $f_{\theta_j}(x)$ 2 satisfy: $f_{\theta_j}(x)$ 3 with high probability, where $f_{\theta_j}(x)$ 4 is determined by the contraction properties from strong convexity and separation, and $f_{\theta_j}(x)$ 5 is an explicit error floor dependent on misspecification and the algorithm's design parameters.

Notable Advancements and Claims

Uniform exponential convergence: Extends exponential convergence guarantees for EM-type algorithms from the generative mixed linear regression case to mixtures of general parametric models with strongly convex, smooth losses.
Agnostic setting: All generative assumptions (including Gaussianity of features or sub-Gaussianity) are removed; only independence of samples is needed.
Generalization of base loss: Incorporates regularized losses, generalized linear models, and smooth, strongly convex classification losses.
Error floor characterization: Outlines dependence of final estimation accuracy on model misspecification, initialization, and loss characteristics.
Initialization relaxations: Demonstrates that $f_{\theta_j}(x)$ 6-proximity to the population minimizer suffices for convergence, strengthening earlier agnostic works.
No sample complexity lower bounds: The analysis does not require explicit lower bounds on batch/sample size per iteration for convergence, a departure from earlier literature which typically assumes large batches or access to the population loss.

Implications

Practical Implications

Broader Applicability: The analysis enables the use of EM and gradient EM algorithms in highly non-ideal settings where the data distribution is heavy-tailed, adversarial, or unknown, extending their role in agnostic and robust machine learning.
Model Flexibility: The results allow practitioners to consider a wider class of base losses and function classes for mixture modeling, covering regularized regression, robust regression, and classification mixtures.
Initialization Sensitivity: While the requirement for a suitable initialization remains, the conditions are considerably relaxed, making practical deployment more feasible, especially when rough-but-reasonable initializers are possible.

Theoretical Implications

Foundations of Robust Unsupervised Learning: The work substantiates that EM-type algorithms are not intrinsically tied to generative models; they are suitable for direct empirical risk minimization in the presence of heterogeneity or misspecification.
Separation and Identifiability in Agnostic Contexts: It clarifies the geometric and statistical criteria (separation in loss landscape, misspecification) necessary to ensure that the EM algorithm can contract towards meaningful minimizers even when ground truth is undefined.
Reduction of Distributional Dependencies: By removing all stochastic/probabilistic assumptions about the feature-label distribution, the results emphasize a worst-case, robust learning perspective that is increasingly relevant for real-world, heterogeneous data.

Future Directions

The paper highlights several open research avenues:

Relaxation of Convexity/Smoothness: Extending analysis to settings where $f_{\theta_j}(x)$ 7 is only weakly convex or non-smooth.
Beyond Sample Splitting: Removal of the resampling assumption, possibly by developing sophisticated leave-one-out or dependency-aware analyses.
Alternative Algorithms: Exploration of other iterative procedures (e.g., non-gradient EM variants, hard EM, alternating minimization) for general agnostic mixtures.
Optimality and Rates: Precise quantification of statistical and computational optimality in these highly general, agnostic frameworks.

Conclusion

This work provides a rigorous and highly general theoretical foundation for the use of gradient EM algorithms in agnostic mixture modeling. By leveraging convex-analytic tools, separation conditions, and smoothness properties, it demonstrates that EM-type approaches can achieve exponential convergence to meaningful population loss minimizers over a broad class of functions and data scenarios—well beyond the generative mixture models for which they were originally devised. The implications for the development of robust, agnostic, and theoretically grounded machine learning algorithms are substantial, and the technical advances furnish an agenda for future research at the intersection of optimization, statistics, and unsupervised learning.

Markdown Report Issue