Papers
Topics
Authors
Recent
Search
2000 character limit reached

MHFormer: Multi-Hypothesis Transformers

Updated 8 May 2026
  • MHFormer is a framework that integrates multiple hypotheses using structured basis functions to achieve loss-consistent, geometry-aware prediction.
  • It employs centroidal Bregman aggregation to balance bias, variance, and diversity, ensuring theoretical robustness in ensemble modeling.
  • The architecture supports continuous-depth parameterization and efficient model compression, making it effective for both classification and regression tasks.

Structured Basis Function Networks (s-BFN) are a class of ensemble learning and parameterization frameworks for deep networks that leverage structured, loss-centric aggregation of multiple hypotheses via basis function expansions. They encompass both function-space parameterizations in continuous-depth neural ODEs and structured, geometry-consistent ensemble combiners designed for loss-aware multi-hypothesis prediction. The s-BFN framework systematically controls the diversity of ensemble predictors and unifies multi-hypothesis and ensemble modeling under the geometry of the task loss, providing principled foundations for aggregation, compression, and regularization across regression and classification tasks (Queiruga et al., 2021, Dominguez et al., 2023, Dominguez et al., 2 Sep 2025).

1. Formal Definition and Core Principles

Structured Basis Function Networks consist of multiple base predictors (hypotheses) h1(x),,hM(x)h_1(x),\ldots,h_M(x), whose outputs are aggregated using basis functions that reflect the structure induced by the loss geometry. For each input xix_i, a structured vector of predictions Di=[h1(xi),,hM(xi)]RdDD_i = [h_1(x_i), \ldots, h_M(x_i)]^\top \in \mathbb{R}^{d_D} is constructed.

Aggregation is governed by centroidal Bregman aggregation, yielding a loss-consistent centroid:

z^i=argminzj=1MαjDϕ(zij,z)\hat{z}_i = \arg\min_z \sum_{j=1}^M \alpha_j D_\phi(z_{ij}, z)

where DϕD_\phi is the Bregman divergence associated with a convex potential ϕ\phi linked to the task loss (e.g., squared loss for regression, cross-entropy for classification) (Dominguez et al., 2 Sep 2025). The s-BFN instantiates this by mapping DiD_i via a basis function map Φ(Di;ϑ)RK\Phi(D_i; \vartheta) \in \mathbb{R}^K, often Gaussian RBFs with trainable centers/scales. The final prediction is

y^i=Φ(Di;ϑ)α\hat{y}_i = \Phi(D_i; \vartheta) \alpha

with αRK×C\alpha \in \mathbb{R}^{K \times C} as learned ensemble weights (Dominguez et al., 2 Sep 2025). This formalism applies to both ensemble hypothesis aggregation and continuous-parameter models as basis-function expansions over latent variables such as depth (Queiruga et al., 2021).

2. Mathematical Parameterization and Basis Expansions

In continuous-parameter settings, notably ODE-Nets, s-BFN parameterizes weights as continuous functions of depth variable xix_i0 via global basis expansions:

xix_i1

where xix_i2 are fixed basis functions (e.g., piecewise-constant, piecewise-linear, or finite element), and xix_i3 are learnable tensors of the same shape as xix_i4 (Queiruga et al., 2021). The parameter function xix_i5 is injected into each ODE-block, ensuring global smoothness and compression, as xix_i6 (number of discretization steps).

These expansions are not restricted to ODE-based architectures; in ensemble aggregation, structured basis mappings are constructed as Gaussian RBFs over the predictor outputs:

xix_i7

with xix_i8 derived from predictor statistics, forming a design matrix xix_i9 for Di=[h1(xi),,hM(xi)]RdDD_i = [h_1(x_i), \ldots, h_M(x_i)]^\top \in \mathbb{R}^{d_D}0 examples and Di=[h1(xi),,hM(xi)]RdDD_i = [h_1(x_i), \ldots, h_M(x_i)]^\top \in \mathbb{R}^{d_D}1 hypotheses (Dominguez et al., 2023).

3. Diversity Control and Centroidal Aggregation

Maintaining adequate predictor diversity is critical to avoid mode collapse in winner-takes-all (WTA) multi-hypothesis systems. s-BFN introduces a diversity parameter Di=[h1(xi),,hM(xi)]RdDD_i = [h_1(x_i), \ldots, h_M(x_i)]^\top \in \mathbb{R}^{d_D}2 modulating the update assignment:

Di=[h1(xi),,hM(xi)]RdDD_i = [h_1(x_i), \ldots, h_M(x_i)]^\top \in \mathbb{R}^{d_D}3

This parameter softens the boundaries between hypothesis "cells" in the centroidal Voronoi tessellation of the output space. Empirically, intermediate values (e.g., Di=[h1(xi),,hM(xi)]RdDD_i = [h_1(x_i), \ldots, h_M(x_i)]^\top \in \mathbb{R}^{d_D}4–Di=[h1(xi),,hM(xi)]RdDD_i = [h_1(x_i), \ldots, h_M(x_i)]^\top \in \mathbb{R}^{d_D}5 for Di=[h1(xi),,hM(xi)]RdDD_i = [h_1(x_i), \ldots, h_M(x_i)]^\top \in \mathbb{R}^{d_D}6) optimize the bias-variance-diversity trade-off, reducing both mode collapse and overfitting (Dominguez et al., 2023, Dominguez et al., 2 Sep 2025).

The ensemble combiner aggregates the predictors' outputs using Bregman geometry, providing loss-consistent centroids for both regression and classification; this results in structured, geometry-respecting hypothesis aggregation (Dominguez et al., 2 Sep 2025).

4. Training Procedures and Closed-Form Estimators

Training s-BFNs is typically a two-stage process:

  • Stage I (Base Predictors): Each base model Di=[h1(xi),,hM(xi)]RdDD_i = [h_1(x_i), \ldots, h_M(x_i)]^\top \in \mathbb{R}^{d_D}7 is optimized, possibly with relaxed WTA assignment using Di=[h1(xi),,hM(xi)]RdDD_i = [h_1(x_i), \ldots, h_M(x_i)]^\top \in \mathbb{R}^{d_D}8. For ODE-Nets, this entails optimization of the basis coefficients.
  • Stage II (Structured Aggregation): A structured dataset Di=[h1(xi),,hM(xi)]RdDD_i = [h_1(x_i), \ldots, h_M(x_i)]^\top \in \mathbb{R}^{d_D}9 of base predictions is constructed, and RBF statistics (centers z^i=argminzj=1MαjDϕ(zij,z)\hat{z}_i = \arg\min_z \sum_{j=1}^M \alpha_j D_\phi(z_{ij}, z)0, scales z^i=argminzj=1MαjDϕ(zij,z)\hat{z}_i = \arg\min_z \sum_{j=1}^M \alpha_j D_\phi(z_{ij}, z)1) are computed.

For regression with squared loss, the ensemble weights are found via a closed-form ridge-regularized least-squares solution:

z^i=argminzj=1MαjDϕ(zij,z)\hat{z}_i = \arg\min_z \sum_{j=1}^M \alpha_j D_\phi(z_{ij}, z)2

where z^i=argminzj=1MαjDϕ(zij,z)\hat{z}_i = \arg\min_z \sum_{j=1}^M \alpha_j D_\phi(z_{ij}, z)3 is the ridge parameter and z^i=argminzj=1MαjDϕ(zij,z)\hat{z}_i = \arg\min_z \sum_{j=1}^M \alpha_j D_\phi(z_{ij}, z)4 collects the targets (Dominguez et al., 2023, Dominguez et al., 2 Sep 2025). For non-quadratic losses (e.g., cross-entropy), gradient-based joint optimization updates both the base predictors and the combiner weights, as in s-BFN classification experiments (Dominguez et al., 2 Sep 2025).

In ODE-parametric s-BFN (Queiruga et al., 2021), after training with a large basis (z^i=argminzj=1MαjDϕ(zij,z)\hat{z}_i = \arg\min_z \sum_{j=1}^M \alpha_j D_\phi(z_{ij}, z)5), compression is achieved via projection or interpolation onto a smaller basis (z^i=argminzj=1MαjDϕ(zij,z)\hat{z}_i = \arg\min_z \sum_{j=1}^M \alpha_j D_\phi(z_{ij}, z)6), requiring only basis-level transformations and no data revisiting or retraining.

5. Theoretical Properties

The Bregman centroidal aggregation confers principled theoretical underpinnings. For strictly convex z^i=argminzj=1MαjDϕ(zij,z)\hat{z}_i = \arg\min_z \sum_{j=1}^M \alpha_j D_\phi(z_{ij}, z)7, the ensemble centroid is

z^i=argminzj=1MαjDϕ(zij,z)\hat{z}_i = \arg\min_z \sum_{j=1}^M \alpha_j D_\phi(z_{ij}, z)8

ensuring that aggregation aligns with the loss geometry (Dominguez et al., 2 Sep 2025). Analysis reveals additive decompositions of ensemble error:

z^i=argminzj=1MαjDϕ(zij,z)\hat{z}_i = \arg\min_z \sum_{j=1}^M \alpha_j D_\phi(z_{ij}, z)9

quantifying diversity's negative contribution to ensemble error. PAC-Bayesian C-bounds relate disagreement and Gibbs risk to the risk of the majority-vote predictor, formalizing how increasing diversity can reduce total error in s-BFN aggregation (Dominguez et al., 2 Sep 2025).

Basis expansions in continuous-depth s-BFN enforce parameter smoothness, stabilizing high-order integrators, increasing compressibility, and reducing memory footprint (Queiruga et al., 2021).

6. Empirical Benchmarks and Application Domains

Empirical studies demonstrate state-of-the-art or superior performance of s-BFN frameworks across domains:

  • Image Classification: s-BFN with continuous batch normalization achieves 94.4% (CIFAR-10) and 79.9% (CIFAR-100), matching or exceeding deep ResNet baselines at lower parameter counts. Projecting from DϕD_\phi0 reduces parameters by DϕD_\phi145% with only a 0.4–0.6% accuracy drop and DϕD_\phi230% faster inference (Queiruga et al., 2021).
  • Sequence Tagging: Continuous-in-depth transformers yield POS tagging accuracy DϕD_\phi398%, with compression from DϕD_\phi4 halving parameters at DϕD_\phi50.3% accuracy loss (Queiruga et al., 2021).
  • Tabular Regression: s-BFN attains lowest RMSE on “Air Quality” (22.46 vs SVM-RBF 29.83) and “Appliances Energy” (101.12 vs SVM-RBF 104.68) (Dominguez et al., 2023, Dominguez et al., 2 Sep 2025).
  • Ensemble Diversity: Intermediate DϕD_\phi6 values yield optimal bias-diversity trade-offs; heterogeneous ensembles using s-BFN exhibit better performance and stability than logit averaging or Mixture-of-Experts on image datasets (Dominguez et al., 2 Sep 2025).

7. Practical Advantages and Extensions

s-BFN frameworks provide several practical benefits:

  • Controllable A Posteriori Compression: Basis function projections enable significant parameter and runtime reduction, with negligible accuracy degradation, without retraining (Queiruga et al., 2021).
  • Loss-Consistent Aggregation: Bregman-centric geometry ensures that ensemble predictions are aligned with the task loss, improving calibration and reliability (Dominguez et al., 2 Sep 2025).
  • Stateful Batch Normalization: Continuous-in-depth batch statistics parameterized via basis expansions are trained end-to-end in ODE architectures (Queiruga et al., 2021).
  • Diversity Regularization: The diversity parameter DϕD_\phi7 provides explicit control over the specialization/generalization balance in multi-hypothesis and ensemble scenarios (Dominguez et al., 2023, Dominguez et al., 2 Sep 2025).

A plausible implication is that structured feature-space aggregation and basis expansion will underpin future scalable, robust, and easily compressible deep models capable of nontrivial predictive uncertainty handling and efficient deployment (Dominguez et al., 2 Sep 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multi-Hypothesis Transformers (MHFormer).