MHFormer: Multi-Hypothesis Transformers

Updated 8 May 2026

MHFormer is a framework that integrates multiple hypotheses using structured basis functions to achieve loss-consistent, geometry-aware prediction.
It employs centroidal Bregman aggregation to balance bias, variance, and diversity, ensuring theoretical robustness in ensemble modeling.
The architecture supports continuous-depth parameterization and efficient model compression, making it effective for both classification and regression tasks.

Structured Basis Function Networks (s-BFN) are a class of ensemble learning and parameterization frameworks for deep networks that leverage structured, loss-centric aggregation of multiple hypotheses via basis function expansions. They encompass both function-space parameterizations in continuous-depth neural ODEs and structured, geometry-consistent ensemble combiners designed for loss-aware multi-hypothesis prediction. The s-BFN framework systematically controls the diversity of ensemble predictors and unifies multi-hypothesis and ensemble modeling under the geometry of the task loss, providing principled foundations for aggregation, compression, and regularization across regression and classification tasks (Queiruga et al., 2021, Dominguez et al., 2023, Dominguez et al., 2 Sep 2025).

1. Formal Definition and Core Principles

Structured Basis Function Networks consist of multiple base predictors (hypotheses) $h_1(x),\ldots,h_M(x)$ , whose outputs are aggregated using basis functions that reflect the structure induced by the loss geometry. For each input $x_i$ , a structured vector of predictions $D_i = [h_1(x_i), \ldots, h_M(x_i)]^\top \in \mathbb{R}^{d_D}$ is constructed.

Aggregation is governed by centroidal Bregman aggregation, yielding a loss-consistent centroid:

$\hat{z}_i = \arg\min_z \sum_{j=1}^M \alpha_j D_\phi(z_{ij}, z)$

where $D_\phi$ is the Bregman divergence associated with a convex potential $\phi$ linked to the task loss (e.g., squared loss for regression, cross-entropy for classification) (Dominguez et al., 2 Sep 2025). The s-BFN instantiates this by mapping $D_i$ via a basis function map $\Phi(D_i; \vartheta) \in \mathbb{R}^K$ , often Gaussian RBFs with trainable centers/scales. The final prediction is

$\hat{y}_i = \Phi(D_i; \vartheta) \alpha$

with $\alpha \in \mathbb{R}^{K \times C}$ as learned ensemble weights (Dominguez et al., 2 Sep 2025). This formalism applies to both ensemble hypothesis aggregation and continuous-parameter models as basis-function expansions over latent variables such as depth (Queiruga et al., 2021).

2. Mathematical Parameterization and Basis Expansions

In continuous-parameter settings, notably ODE-Nets, s-BFN parameterizes weights as continuous functions of depth variable $x_i$ 0 via global basis expansions:

$x_i$ 1

where $x_i$ 2 are fixed basis functions (e.g., piecewise-constant, piecewise-linear, or finite element), and $x_i$ 3 are learnable tensors of the same shape as $x_i$ 4 (Queiruga et al., 2021). The parameter function $x_i$ 5 is injected into each ODE-block, ensuring global smoothness and compression, as $x_i$ 6 (number of discretization steps).

These expansions are not restricted to ODE-based architectures; in ensemble aggregation, structured basis mappings are constructed as Gaussian RBFs over the predictor outputs:

$x_i$ 7

with $x_i$ 8 derived from predictor statistics, forming a design matrix $x_i$ 9 for $D_i = [h_1(x_i), \ldots, h_M(x_i)]^\top \in \mathbb{R}^{d_D}$ 0 examples and $D_i = [h_1(x_i), \ldots, h_M(x_i)]^\top \in \mathbb{R}^{d_D}$ 1 hypotheses (Dominguez et al., 2023).

3. Diversity Control and Centroidal Aggregation

Maintaining adequate predictor diversity is critical to avoid mode collapse in winner-takes-all (WTA) multi-hypothesis systems. s-BFN introduces a diversity parameter $D_i = [h_1(x_i), \ldots, h_M(x_i)]^\top \in \mathbb{R}^{d_D}$ 2 modulating the update assignment:

$D_i = [h_1(x_i), \ldots, h_M(x_i)]^\top \in \mathbb{R}^{d_D}$ 3

This parameter softens the boundaries between hypothesis "cells" in the centroidal Voronoi tessellation of the output space. Empirically, intermediate values (e.g., $D_i = [h_1(x_i), \ldots, h_M(x_i)]^\top \in \mathbb{R}^{d_D}$ 4– $D_i = [h_1(x_i), \ldots, h_M(x_i)]^\top \in \mathbb{R}^{d_D}$ 5 for $D_i = [h_1(x_i), \ldots, h_M(x_i)]^\top \in \mathbb{R}^{d_D}$ 6) optimize the bias-variance-diversity trade-off, reducing both mode collapse and overfitting (Dominguez et al., 2023, Dominguez et al., 2 Sep 2025).

The ensemble combiner aggregates the predictors' outputs using Bregman geometry, providing loss-consistent centroids for both regression and classification; this results in structured, geometry-respecting hypothesis aggregation (Dominguez et al., 2 Sep 2025).

4. Training Procedures and Closed-Form Estimators

Training s-BFNs is typically a two-stage process:

Stage I (Base Predictors): Each base model $D_i = [h_1(x_i), \ldots, h_M(x_i)]^\top \in \mathbb{R}^{d_D}$ 7 is optimized, possibly with relaxed WTA assignment using $D_i = [h_1(x_i), \ldots, h_M(x_i)]^\top \in \mathbb{R}^{d_D}$ 8. For ODE-Nets, this entails optimization of the basis coefficients.
Stage II (Structured Aggregation): A structured dataset $D_i = [h_1(x_i), \ldots, h_M(x_i)]^\top \in \mathbb{R}^{d_D}$ 9 of base predictions is constructed, and RBF statistics (centers $\hat{z}_i = \arg\min_z \sum_{j=1}^M \alpha_j D_\phi(z_{ij}, z)$ 0, scales $\hat{z}_i = \arg\min_z \sum_{j=1}^M \alpha_j D_\phi(z_{ij}, z)$ 1) are computed.

For regression with squared loss, the ensemble weights are found via a closed-form ridge-regularized least-squares solution:

$\hat{z}_i = \arg\min_z \sum_{j=1}^M \alpha_j D_\phi(z_{ij}, z)$ 2

where $\hat{z}_i = \arg\min_z \sum_{j=1}^M \alpha_j D_\phi(z_{ij}, z)$ 3 is the ridge parameter and $\hat{z}_i = \arg\min_z \sum_{j=1}^M \alpha_j D_\phi(z_{ij}, z)$ 4 collects the targets (Dominguez et al., 2023, Dominguez et al., 2 Sep 2025). For non-quadratic losses (e.g., cross-entropy), gradient-based joint optimization updates both the base predictors and the combiner weights, as in s-BFN classification experiments (Dominguez et al., 2 Sep 2025).

In ODE-parametric s-BFN (Queiruga et al., 2021), after training with a large basis ( $\hat{z}_i = \arg\min_z \sum_{j=1}^M \alpha_j D_\phi(z_{ij}, z)$ 5), compression is achieved via projection or interpolation onto a smaller basis ( $\hat{z}_i = \arg\min_z \sum_{j=1}^M \alpha_j D_\phi(z_{ij}, z)$ 6), requiring only basis-level transformations and no data revisiting or retraining.

5. Theoretical Properties

The Bregman centroidal aggregation confers principled theoretical underpinnings. For strictly convex $\hat{z}_i = \arg\min_z \sum_{j=1}^M \alpha_j D_\phi(z_{ij}, z)$ 7, the ensemble centroid is

$\hat{z}_i = \arg\min_z \sum_{j=1}^M \alpha_j D_\phi(z_{ij}, z)$ 8

ensuring that aggregation aligns with the loss geometry (Dominguez et al., 2 Sep 2025). Analysis reveals additive decompositions of ensemble error:

$\hat{z}_i = \arg\min_z \sum_{j=1}^M \alpha_j D_\phi(z_{ij}, z)$ 9

quantifying diversity's negative contribution to ensemble error. PAC-Bayesian C-bounds relate disagreement and Gibbs risk to the risk of the majority-vote predictor, formalizing how increasing diversity can reduce total error in s-BFN aggregation (Dominguez et al., 2 Sep 2025).

Basis expansions in continuous-depth s-BFN enforce parameter smoothness, stabilizing high-order integrators, increasing compressibility, and reducing memory footprint (Queiruga et al., 2021).

6. Empirical Benchmarks and Application Domains

Empirical studies demonstrate state-of-the-art or superior performance of s-BFN frameworks across domains:

Image Classification: s-BFN with continuous batch normalization achieves 94.4% (CIFAR-10) and 79.9% (CIFAR-100), matching or exceeding deep ResNet baselines at lower parameter counts. Projecting from $D_\phi$ 0 reduces parameters by $D_\phi$ 145% with only a 0.4–0.6% accuracy drop and $D_\phi$ 230% faster inference (Queiruga et al., 2021).
Sequence Tagging: Continuous-in-depth transformers yield POS tagging accuracy $D_\phi$ 398%, with compression from $D_\phi$ 4 halving parameters at $D_\phi$ 50.3% accuracy loss (Queiruga et al., 2021).
Tabular Regression: s-BFN attains lowest RMSE on “Air Quality” (22.46 vs SVM-RBF 29.83) and “Appliances Energy” (101.12 vs SVM-RBF 104.68) (Dominguez et al., 2023, Dominguez et al., 2 Sep 2025).
Ensemble Diversity: Intermediate $D_\phi$ 6 values yield optimal bias-diversity trade-offs; heterogeneous ensembles using s-BFN exhibit better performance and stability than logit averaging or Mixture-of-Experts on image datasets (Dominguez et al., 2 Sep 2025).

7. Practical Advantages and Extensions

s-BFN frameworks provide several practical benefits:

Controllable A Posteriori Compression: Basis function projections enable significant parameter and runtime reduction, with negligible accuracy degradation, without retraining (Queiruga et al., 2021).
Loss-Consistent Aggregation: Bregman-centric geometry ensures that ensemble predictions are aligned with the task loss, improving calibration and reliability (Dominguez et al., 2 Sep 2025).
Stateful Batch Normalization: Continuous-in-depth batch statistics parameterized via basis expansions are trained end-to-end in ODE architectures (Queiruga et al., 2021).
Diversity Regularization: The diversity parameter $D_\phi$ 7 provides explicit control over the specialization/generalization balance in multi-hypothesis and ensemble scenarios (Dominguez et al., 2023, Dominguez et al., 2 Sep 2025).

A plausible implication is that structured feature-space aggregation and basis expansion will underpin future scalable, robust, and easily compressible deep models capable of nontrivial predictive uncertainty handling and efficient deployment (Dominguez et al., 2 Sep 2025).