FMAPLS: Bayesian EM for Label Shift

Updated 30 November 2025

The paper introduces a Bayesian EM framework that jointly estimates target class priors and Dirichlet hyperparameters to correct label shift in supervised learning.
It leverages a closed-form linear surrogate function for efficient hyperparameter updates, reducing KL divergence by up to 40% in severe imbalance scenarios.
Empirical evaluations on datasets like CIFAR100 and ImageNet-LT demonstrate significant accuracy gains and robustness in both batch and online adaptations.

Full Maximum A Posterior Label Shift (FMAPLS) is a Bayesian framework for label-shift correction in supervised learning. Under the label shift assumption—where the class prior distribution varies between source (training) and target (test) domains, but class-conditional likelihoods remain fixed—FMAPLS enables joint and dynamic estimation of both the unknown target priors and the Dirichlet hyperparameters that govern uncertainty over these priors. The method leverages Expectation-Maximization (EM) algorithms in both batch and online variants and introduces a closed-form Linear Surrogate Function (LSF) for efficient hyperparameter updates. Empirical results demonstrate that FMAPLS and its online form outperform previous maximum a posteriori-based label-shift estimators, particularly under severe class imbalance and distributional uncertainty, in terms of Kullback–Leibler divergence and classification accuracy (Hu et al., 23 Nov 2025).

1. Problem Formulation and Generative Model

FMAPLS addresses the canonical label shift scenario with the following structure:

Source (training) data: $(X_s, Y_s)\sim P_s$ , where $P_s(Y)$ is the source class prior $\pi_s=(\epsilon_1,...,\epsilon_K)$ , and $P_s(X|Y)$ is the known class-conditional likelihood.
Target (test) data: $(X_t, Y_t)\sim P_t$ , assuming $P_t(X|Y) = P_s(X|Y)$ but $P_t(Y) = \pi_t\neq\pi_s$ .
Classifier: Trained on $P_s$ , provides $f_j(x) = P_s(Y=j|X=x)$ . Under label shift, $P(Y=j|X=x; \pi)\propto f_j(x)\pi_j$ .
Bayesian model: Places a Dirichlet prior on $\pi$ with hyperparameter $\alpha=(\alpha_1,...,\alpha_K)>0$ :

$p(\pi|\alpha)=\mathrm{Dir}(\pi;\alpha)=\frac{1}{B(\alpha)}\prod_{j=1}^K \pi_j^{\alpha_j-1},\quad B(\alpha)=\frac{\prod_{j}\Gamma(\alpha_j)}{\Gamma(\sum_{j=1}^K\alpha_j)}$

Optionally, a weak prior $p(\alpha)$ may be included.

Given $N$ test samples $\{x_i\}_{i=1}^N$ , the joint posterior for parameters $\theta = (\pi, \alpha)$ is (up to normalization):

$p(\pi,\alpha|\{x_i\}) \propto p(\alpha)\,\mathrm{Dir}(\pi;\alpha)\prod_{i=1}^N\sum_{j=1}^K f_j(x_i)\pi_j$

The log-posterior (incomplete data) is:

$\mathcal{L}(\pi, \alpha) = \log p(\pi|\alpha) + \log p(\alpha) + \sum_{i=1}^N \log\left(\sum_{j=1}^K f_j(x_i)\pi_j\right)$

2. Batch EM Algorithm for Joint Estimation

FMAPLS employs a batch Expectation-Maximization (EM) procedure by treating the unknown test labels $Y_i$ as latent variables:

E-step: Computes posterior responsibilities

$r_{ij}^{(t)} = P(y_i=j|x_i;\pi^{(t)}) = \frac{f_j(x_i)\pi_j^{(t)}}{\sum_{k=1}^K f_k(x_i)\pi_k^{(t)}}$

M-step: Separately maximizes with respect to $\pi$ $π$ and $\alpha$ $α$ using the expected complete-data log-posterior.
- Update for $\pi$ (closed form):
$\pi_j^{(t+1)} = \frac{\alpha_j^{(t)}-1 + R_j^{(t)}}{\sum_{k=1}^K \left(\alpha_k^{(t)}-1 + R_k^{(t)}\right)},\qquad R_j^{(t)} = \sum_{i=1}^N r_{ij}^{(t)}$ - Update for $\alpha$ (MAP estimate for Dirichlet):

$\alpha^{(t+1)} = \arg\max_{\alpha>0} \left\{ -\log B(\alpha) + \sum_{j=1}^K (\alpha_j-1)\log\pi_j^{(t+1)} + \log p(\alpha)\right\}$

In standard MAPLS, this subproblem is solved via gradient ascent involving digamma functions, with significant computation if $K$ is large.

3. Linear Surrogate Function (LSF) Update

To overcome the computational and tuning issues of gradient-based updates for $\alpha$ , FMAPLS introduces a Linear Surrogate Function (LSF):

Key mechanism: Replace the $\alpha$ -subproblem by enforcing $\alpha \propto \pi$ with a large constant $c$ :

$\alpha_j \leftarrow \hat{c}\cdot\pi_j,\quad \hat{c} := c/\max_k\pi_k$

where $\max_j \alpha_j = c$ .

Rationale: Direct substitution $\alpha_j = \hat{c}\,\pi_j$ yields updates that are asymptotically stationary as $c\to\infty$ (gradient terms $O(1/\hat{c})$ vanish), so in practice, a suitably large $c$ provides accurate approximation without iterative gradients.
Computational benefit: The per-iteration cost drops from $O(T_{\text{grad}}\cdot K)$ (gradient ascent) to $O(K)$ (LSF closed-form).

4. Online-FMAPLS for Streaming Data

The online-FMAPLS variant enables real-time adaptation to non-stationary or streaming data by employing stochastic approximation of sufficient statistics:

Stochastic responsibilities: At time step $\tau$ , for incoming $x^\tau$ , compute

$B_j^\tau = \frac{f_j(x^\tau)\pi_j^\tau}{\sum_k f_k(x^\tau)\pi_k^\tau}$

Maintain running statistics $S_j^\tau$ (per-class) and $s_0^\tau$ (total), initialized as $S_j^0=1$ , $s_0^0=1$ .

Online update (with forgetting rate $\gamma$ ):

$\begin{align*} s_0^{\tau+1} &= (1-\gamma)s_0^\tau + \gamma\cdot 1 \ S_j^{\tau+1} &= (1-\gamma)S_j^\tau + \gamma\cdot B_j^\tau \end{align*}$

M-step: Update

$\pi_j^{\tau+1} = \frac{(\alpha_j^\tau - 1) + (1-\gamma) + \gamma B_j^\tau}{\sum_k \left[(\alpha_k^\tau - 1) + (1-\gamma) + \gamma B_k^\tau\right]}$

and set $\alpha_j^{\tau+1} = \hat{c}\,\pi_j^{\tau+1}$ .

Complexity: $O(K)$ per data sample, enabling scalable, real-time operation.

5. Convergence–Accuracy Trade-Off

Under the LSF regime ( $\alpha_j = \hat{c}\pi_j$ ), the step size of the online algorithm is governed by $c$ :

The iterative increment satisfies $|\pi_j^{\tau+1} - \pi_j^\tau| = O(1/\hat{c})$ .
Interpretation: Larger $c$ yields more accurate (less biased) stationary points, but each update becomes smaller, slowing convergence.

A practical implication is that $c$ must be selected to balance estimation accuracy and adaptation speed: large enough for reliability, but not so large as to impede responsiveness, especially under concept drift or shifting priors.

6. Empirical Performance Evaluation

Extensive experiments were conducted on long-tail variants of CIFAR100 ( $K=100$ ) and ImageNet-LT ( $K \approx 1000$ ):

Training priors: Long-tail imbalanced, controlled by $\rho\in\{0.2, 0.1, 0.05, 0.02\}$ .
Test priors: Either shuffled long-tail or Dirichlet-drawn (symmetric $\alpha_\text{test}\in\{1,1.5,2,2.5,3\}$ ).
Metrics: KL divergence $D_{\mathrm{KL}}(\pi_{\text{true}}\|\pi_{\text{est}})$ and post-shift classification accuracy.

Results, averaged over 100 runs, confirm:

FMAPLS reduces KL divergence by up to $40\%$ over MAPLS in settings of severe imbalance ( $\rho=0.02$ ) and high prior uncertainty ( $\alpha_{\text{test}}=1$ ).
Up to $3$– $4\%$ absolute accuracy gains over MAPLS in challenging cases.
Online-FMAPLS achieves up to $12\%$ KL reduction over MAPLS, with only $0.5$– $1.0\%$ relative accuracy drop versus batch FMAPLS.
Convergence (measured by KL) stabilizes within $2000$ iterations on CIFAR100 and $10,000$ iterations on ImageNet-LT.

Method	Update Complexity	KL Reduction vs MAPLS	Typical Acc. Gain
FMAPLS+Gradient	$O(NK+T_\text{grad}K)$	up to $40\%$	$3$– $4\%$ absolute
FMAPLS+LSF	$O(NK+K)$	up to $40\%$	$3$– $4\%$ absolute
Online-FMAPLS	$O(K)$	up to $12\%$	$0.5$– $1.0\%$ drop

7. Implementation and Practical Guidance

FMAPLS is particularly robust in scenarios with pronounced class imbalance and uncertain or dynamically shifting target priors. The dynamic $\alpha$ adaptation provides a significant advantage over static-hyperparameter MAPLS approaches.

LSF hyperparameter $c$ should be chosen in the range $10$–$100$; $c\approx 50$ –$100$ achieves reliable stationary points with reasonable convergence speed.
The forgetting rate $\gamma$ for online-FMAPLS should typically fall in $[0.1, 0.3]$ , with larger values used for more rapid adaptation in highly non-stationary streams.
For large $N \gg K$ , batch FMAPLS is recommended due to its efficiency and statistical stability; online-FMAPLS is appropriate when $N$ is small or streaming data is encountered.

FMAPLS offers a Bayesian-EM framework for label-shift correction, accommodating dynamic target priors, with both batch and online variants. Its combination of closed-form surrogate updates and scalable computation makes it suitable for large-scale, imbalanced, or temporally-evolving domains (Hu et al., 23 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Bayesian-based Online Label Shift Estimation with Dynamic Dirichlet Priors (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Full Maximum A Posterior Label Shift (FMAPLS).