Minimum Pair-wise Discriminant Gain

Updated 22 April 2026

Minimum pair-wise discriminant gain is a metric that quantifies the worst-case separability by measuring the minimum statistical divergence between any two class distributions.
It supports task-oriented designs in over-the-air computation by ensuring balanced decision boundaries for the most confusable class pairs.
Successive convex approximation techniques enable practical optimization of this non-convex metric in distributed edge AI and dimensionality reduction applications.

The minimum pair-wise discriminant gain is a classification-oriented metric that quantifies the worst-case separability between any two classes in a given feature space. It is defined as the minimum over all pairs of classes of a discriminant gain function, which typically measures the statistical distance or divergence between the class-conditional distributions. This criterion has emerged as a foundational objective in task-driven feature aggregation for over-the-air computation (AirComp), kernel discriminant analysis, and related dimensionality reduction methods. Unlike average discriminant gain metrics, which maximize mean class separation, minimum pair-wise gain explicitly targets the hardest-to-separate class pair, thereby promoting balanced inference accuracy across all classes (Jiao et al., 2024, Zhuang et al., 2023, Iosifidis, 2018).

1. Formal Definition and Mathematical Formulation

Let $\hat{\bf x} = (\hat x_1, \dots, \hat x_M)$ be the aggregated feature vector obtained at a central server, representing $M$ features across $L$ target classes. Assume each coordinate $\hat x_m$ follows a class-conditional Gaussian mixture:

$\hat x_m \sim \frac{1}{L}\sum_{\ell=1}^L \mathcal N(\hat\mu_{\ell,m}, \hat\sigma_m^2)$

The discriminant gain between a pair of classes $(\ell, \ell')$ is given by

$G_{\ell, \ell'} = \sum_{m=1}^M \frac{(\hat\mu_{\ell,m} - \hat\mu_{\ell',m})^2}{\hat\sigma_m^2}$

The minimum pair-wise discriminant gain is then

$G_{\min} = \min_{1 \leq \ell < \ell' \leq L} G_{\ell, \ell'} = \min_{\ell \neq \ell'} \sum_{m=1}^M \frac{(\hat\mu_{\ell,m} - \hat\mu_{\ell',m})^2}{\hat\sigma_m^2}$

The objective in many task-oriented frameworks is to maximize $G_{\min}$ with respect to the relevant system parameters (e.g., aggregation weights, feature transformations, or transmission power levels) (Jiao et al., 2024, Zhuang et al., 2023, Iosifidis, 2018).

2. Distinction from Average Discriminant Gain and Associated Implications

The average pair-wise discriminant gain is defined as

$G_{\rm avg} = \frac{2}{L(L-1)} \sum_{1 \leq \ell < \ell' \leq L} G_{\ell, \ell'}$

While $M$ 0 maximizes the mean separation across all class pairs, it does not address the scenario where some pairs remain poorly separated. In contrast, $M$ 1 enforces a strict lower bound on the discriminative capability for every class pair, driving up the separation for the most confusable classes.

Theoretical and empirical analysis indicates that maximizing $M$ 2 ensures more uniform (and thus robust) classification accuracy across all classes, as it precludes low-margin class pairs that degrade worst-case performance. For instance, in federated/integrated AirComp settings, schemes optimized for $M$ 3 may exhibit pronounced class imbalance, whereas $M$ 4-maximizing schemes yield consistently balanced decision boundaries and per-class accuracies (Jiao et al., 2024, Zhuang et al., 2023).

3. Optimization Strategies and SCA-Based Solutions

The optimization of $M$ 5 is inherently non-convex due to the nested minimum and the nonlinear dependence of $M$ 6 on system parameters. In the AirComp paradigm, let $M$ 7 denote the transmission precoder of device $M$ 8 for feature $M$ 9, with $L$ 0 as the channel gain. The aggregated feature statistics are

$L$ 1

with $L$ 2 the local class mean and $L$ 3, $L$ 4 denoting device and channel noise variances.

This leads to the constrained max-min program:

$L$ 5

To address non-convexity, successive convex approximation (SCA) is employed:

Auxiliary variables introduce an epigraph form to decouple the nested minimum.
Non-convex terms are linearized (e.g., via first-order Taylor expansion) around the current iterates, resulting in a sequence of convex quadratic constrained quadratic programming (QCQP) subproblems.
Each subproblem is tractable (e.g., solvable by standard solvers like CVX), and the algorithm iterates until convergence to a stationary point (Jiao et al., 2024, Zhuang et al., 2023).

Key properties:

Each SCA iteration guarantees a non-decreasing objective.
The approach converges to local optima.
Problem size scales with the number of devices ( $L$ 6), feature dimensions ( $L$ 7), and classes ( $L$ 8).

4. Role in Over-the-Air Computation and Edge AI

The minimum pair-wise discriminant gain underpins task-driven AirComp designs for edge-device co-inference. In this context:

Features from distributed edge devices are aggregated in the analog domain via synchronized wireless transmission, leveraging the superposition property of wireless channels.
Power/precoder design is adapted in a task-oriented manner, allocating more transmit energy to features that critically affect $L$ 9—noting that some features may be more informative for particularly hard-to-separate classes.
Joint (rather than independent) optimization across all feature elements enables fine-grained balancing of worst-case separability—yielding a distinct improvement over prior element-wise or average-based approaches (Jiao et al., 2024, Zhuang et al., 2023).

This framework is applicable to integrated sensing-communication-computation (ISCC) systems and is especially effective for applications such as human motion recognition and other multi-device cooperative inference scenarios.

Classical and kernel-based discriminant analysis methods—such as Linear Discriminant Analysis (LDA), Kernel Discriminant Analysis (KDA), and Component Analysis methods—typically optimize i) average inter-class distances or ii) class-to-global-mean separation (with at most $\hat x_m$ 0 meaningful directions for $\hat x_m$ 1 classes).

The Class Mean Vector Component Analysis (CMVCA) (Iosifidis, 2018) preserves all pair-wise class-mean distances by selecting projections (eigenvectors) that maximize the weighted sum of squared differences between class means in the feature space.
In contrast to KPCA (which is unsupervised) and KDA (which emphasizes cluster-to-global separation), CMVCA and minimum pair-wise criteria explicitly monitor and guarantee strictly positive worst-case preserved distance for every class pair.
The per-pair discriminant gain in subspace selection ( $\hat x_m$ 2) reflects the preserved fraction of separation for each pair, and ensuring $\hat x_m$ 3 allows for explicit worst-case bounds.

Neural Discriminant Analysis (NDA) (Ha et al., 2021) in deep networks typically maximizes average (not minimum) pairwise class-centroid distances but in practice can lead to larger minimum margins due to regularization effects, though not to the explicit extent provided by minimum pair-wise strategies.

6. Empirical Findings and Practical Impact

Extensive experiments on human motion recognition and related tasks confirm that:

SVM and MLP accuracy increases monotonically with $\hat x_m$ 4.
AirComp schemes maximizing $\hat x_m$ 5 achieve the most balanced and highest classification accuracy across all classes, outperforming average-based and MMSE baselines (e.g., with $\hat x_m$ 6 devices: SVM accuracy rises to $\hat x_m$ 792\% vs 88\% baseline; MLP to $\hat x_m$ 895\% vs 91\%).
As the number of devices or total transmit power increases, $\hat x_m$ 9-maximized designs retain uniform per-class performance, while alternatives exhibit class imbalance (Jiao et al., 2024, Zhuang et al., 2023).

This demonstrates the direct link between worst-case discriminant gain and robust, equitable classification in distributed inference.

7. Algorithmic and Implementation Considerations

A summary of algorithmic steps in tasks like kernel-based dimensionality reduction or task-oriented AirComp is as follows:

Compute class means and covariances from training data or aggregate signals.
Formulate the $\hat x_m \sim \frac{1}{L}\sum_{\ell=1}^L \mathcal N(\hat\mu_{\ell,m}, \hat\sigma_m^2)$ 0-maximization objective, specifying system variables (feature transform, transmission precoders, power levels).
Reformulate the max-min problem via auxiliary variables into an appropriate optimization framework (epigraph, d.c., or SCA).
Solve iteratively, updating the linearization point at each step until convergence.
In kernel settings, monitor the worst-case per-pair preserved distance as embedding dimension $\hat x_m \sim \frac{1}{L}\sum_{\ell=1}^L \mathcal N(\hat\mu_{\ell,m}, \hat\sigma_m^2)$ 1 increases, halting when a required lower bound is achieved (Iosifidis, 2018).

Practical issues include the need for channel state information, class statistics, and computational tractability for large-scale edge-device networks. A plausible implication is that adoption of minimum pair-wise discriminant gain can form a foundation for fairness-driven or adversary-robust distributed learning schemes in wireless and federated settings.

References:

(Jiao et al., 2024) Task-oriented Over-the-air Computation for Edge-device Co-inference with Balanced Classification Accuracy
(Zhuang et al., 2023) Integrated Sensing-Communication-Computation for Over-the-Air Edge AI Inference
(Iosifidis, 2018) Class Mean Vector Component and Discriminant Analysis
(Ha et al., 2021) Learning a Discriminant Latent Space with Neural Discriminant Analysis