Latent-Space Expert Aggregation

Updated 14 April 2026

Latent-space expert aggregation is a technique for integrating multiple expert outputs within a shared latent space that preserves angular and radial properties.
It employs methods like Spherical Barycentric Aggregation and concatenation fusion to maintain anisotropic signals and avoid collapse of expert embeddings.
The approach supports diverse applications including LLM ensembles, recommender systems, and crowdsourced label fusion through structured, geometry-aware integration.

Latent-space expert aggregation refers to the paradigm of learning, combining, or aligning the outputs of multiple expert models within a shared or coordinated latent vector space. This arises in architectures such as mixture-of-experts (MoE), multi-model fusion, crowdsourcing label aggregation, LLM ensembles, multi-behavior recommender systems, and tensor-based expertise mining. Instead of simple output voting or sequential ensembling, latent-space aggregation relies on collaborative encoding and information transfer in deep or structured representation spaces. This enables complementary, geometry-aware, specialization- or domain-sensitive integration of expert signals, often improving performance on heterogeneous, multi-domain, or otherwise stratified tasks.

1. Geometric Foundations and Aggregation Operators

Recent work has shown that standard linear aggregation in MoE embedding models—where expert outputs are summed with learned weights—often violates the actual geometric structure of expert representations. Empirically, expert output vectors $e_i(x)$ are characterized by two properties: (a) radial consistency (norms are tightly concentrated, $\|e_1\|/\|e_2\|\approx 1$ ) and (b) significant angular separation between the normalized states $\{\hat e_i\}$ , with typical pairwise angles exceeding $40^\circ$ . This reveals that the expert outputs reside on a shared hyperspherical manifold, differentiating by direction rather than scale.

Linear summation of these outputs,

$z_{\text{lin}} = \sum_{i=1}^K w_i e_i$

induces “inward collapse” toward the interior of the hypersphere: the norm $\|z_{\text{lin}}\|$ is strictly smaller than the original radius $r$ whenever the experts are not perfectly aligned, corrupting both magnitude and direction of the embeddings and destabilizing tasks like similarity search or clustering.

To address these issues, Spherical Barycentric Aggregation (SBA) decouples radial and angular components. Each expert output is decomposed as $(r_i, \hat u_i)$ ; scalar aggregation is performed on radii, while the angular components are barycentered on the hypersphere via weighted geodesic minimization. This prevents collapse and distortion, yielding an aggregated vector $z_{SBA} = (\sum w_i r_i)\hat z$ that preserves both the hyperspherical structure and the angular separation between semantic directions (Kachuee et al., 15 Feb 2026).

2. Routing, Fusion, and Anisotropy Preservation

Beyond geometrical compatibility, the method of fusion among expert latent outputs is crucial, especially for heterogeneous experts such as frozen LLMs or independently pre-trained modules. Approaches like information-preserving concatenation fusion project each expert's latent embedding into a shared dimensionality but avoid any direct linear mixing. The resulting joint representation is a concatenation $z = [h_1'; \ldots; h_k']$ . This preserves the anisotropic structure (“spiky” directions) within each expert's subspace, allowing downstream MLPs to select between mutually orthogonal signals.

Averaging or scalar-mixed fusion, by contrast, collapses these manifolds, blending distinct and possibly complementary “principal directions” and incurring destructive interference, particularly problematic in multi-lingual or multi-domain settings. Empirical results demonstrate that concatenation-fused mixtures win in both task-specific AUC (+0.22% over weighted averaging) and computational throughput (13.72 QPS, +9% over dense baselines) (Liu et al., 18 Nov 2025).

End-to-end trainable routers dynamically allocate queries to relevant experts, employing softmax-based selection, top- $\|e_1\|/\|e_2\|\approx 1$ 0 cutoff, and load-balancing losses to maintain both efficiency and usage diversity across the pool (Liu et al., 18 Nov 2025, Fein-Ashley et al., 25 Sep 2025).

3. Structured and Stratified Latent Expert Spaces

Latent-space expert aggregation is not restricted to standard MoE. Stratified manifold modeling posits that latent spaces, especially of LLMs, are composed of locally low-dimensional “strata,” each best captured by a different expert. Sparse MoE formulations combine dictionary-learning “experts” with varying $\|e_1\|/\|e_2\|\approx 1$ 1 sparsity levels. An attention-based soft gating network assigns each input embedding $\|e_1\|/\|e_2\|\approx 1$ 2 to a mixture of these experts. The support size $\|e_1\|/\|e_2\|\approx 1$ 3 of each expert determines the modeled local linear dimension, and empirical analysis reveals that data from different semantic domains (e.g., reviews vs. tweets) naturally occupy distinct submanifolds, matched to experts of different intrinsic dimensions (Li et al., 19 Feb 2025). Expert assignment entropy quantifies the sharpness of routing, with lower entropy and harder expert selection in larger models.

Semantic interpretability is further enhanced by visual inspection: t-SNE and PCA projections of embeddings colored by expert show that semantically similar inputs are clustered in the same stratum, demonstrating the alignment between domain variation and expert specialization.

4. Supervised, Self-Supervised, and Gated Factor Aggregation

Latent expert aggregation is leveraged in recommendation systems to decompose user/item representations into a collection of independent latent factors, each modeled by a specialist expert. Gating networks—commonly implemented as noisy top- $\|e_1\|/\|e_2\|\approx 1$ 4 softmax selectors—determine which subset of experts is “open” for each user, affording an adaptive, per-user combination of factors (Yan et al., 19 Mar 2026). To ensure interpretability and prevent factor entanglement, self-supervised contrastive objectives enforce both within-expert consistency and between-expert independence: InfoNCE-style losses penalize overlap between different experts’ subspaces.

Multi-behavior data is incorporated via LightGCN layers aggregated across behavioral graphs; the resultant embeddings are fed to the expert network. This leads to more accurate and interpretable representations, where each user typically selects only a few latent factors (experts), reflecting true preference subspaces.

5. Crowdsourced Label Fusion and Graph Aggregation

In crowdsourcing scenarios, the goal is to recover ground-truth labels from noisy worker data. Latent-space aggregation approaches model both workers and tasks as nodes in a heterogeneous bipartite graph, where embeddings are updated via cross-type and homogeneous attention-based message passing (Wu et al., 2020). Each worker’s output is incorporated in the latent space rather than via naive majority voting; the aggregation architecture incorporates both local (worker-task) and global (worker-worker/task-task) correlations via message-passing and attention.

The architecture consists of alternating blocks: cross-type MP2 layers attending between workers and tasks, and homogeneous COR layers capturing latent correlations among nodes of the same type. Classification is then performed via a softmax over the latent task vectors. This approach jointly models multi-dimensional worker reliability, inter-task similarity, and output uncertainty in the latent geometric structure.

6. Expert-Guided Latent Spaces and Metric Learning

Expert knowledge can also be injected directly into the learning of the latent space via specially designed constraints. In traffic scenario embedding, objective functions penalize violation of expert-derived structural relationships, such as intersection topology or route similarity. Hierarchical margin losses enforce that scenarios matching on graph and route are closer in latent space than those differing in route, which are in turn closer than those differing in topology (Wurst et al., 2022).

Automatic mining of scenario tuples (anchor, positive, negative, and hard negative) relies on expert-defined attributes, and the training objective combines metric losses with a sparsity-driven reconstruction term. The resulting latent space is quantitatively superior (AUC, cluster accuracy) and visually exhibits clear separation of scenario classes, underlying physical properties, and robustness to ablations of the expert constraints.

7. Tensor Factorization and Hierarchical Expertise Inference

Latent-space aggregation extends to expertise estimation and expert search in knowledge networks. Tensor factorization approaches jointly decompose question $\|e_1\|/\|e_2\|\approx 1$ 5topic $\|e_1\|/\|e_2\|\approx 1$ 6vote $\|e_1\|/\|e_2\|\approx 1$ 7expert tensors and auxiliary matrices derived from site-user and topic-user incidence. Hierarchical group-lasso regularizers (tree-guided learning) exploit topical structures, encouraging group sparsity so that experts aggregate in latent subspaces aligned to topic hierarchies (Huang et al., 2018).

Alternating-least-squares solves for all latent factors, and expertise scores are derived from the resulting topic-expert matrix factorization or via mode-collapsed contraction over the tensor. This yields cross-domain, robust expertise estimation that outperforms both reputation-based and baseline matrix/tensor methodologies as quantified by Precision@10 and MRR.

Latent-space expert aggregation encompasses a cross-cutting set of theoretical and algorithmic innovations. They include geometry-preserving fusion under hyperspherical constraints, anisotropy-aware concatenation, stratified MoE for manifold-union modeling, self-supervised disentanglement of factor experts, attention-based multi-way graph aggregation, and structured tensor decomposition with hierarchical regularization. Each leverages the inductive biases of geometry, topology, domain stratification, and data-driven specialization to enhance the integrative power of expert ensembles in high-dimensional representation spaces.