Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 96 tok/s
Gemini 3.0 Pro 48 tok/s Pro
Gemini 2.5 Flash 155 tok/s Pro
Kimi K2 197 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Coordinate-Guided Mixture of Experts (MoE)

Updated 17 November 2025
  • Coordinate-Guided MoE is an architectural paradigm that leverages spatial, temporal, or feature coordinates to direct data to specialized expert networks for localized, piecewise continuous approximations.
  • It integrates sophisticated gating mechanisms with EM or gradient-based optimization to efficiently perform sparse, high-dimensional regression and reconstruction tasks.
  • Recent innovations like Cartesian product routing and manager conditioning improve expert sharing, reduce expert starvation, and enhance model interpretability and reconstruction quality.

Coordinate-Guided Mixture of Experts (MoE) is an architectural paradigm within expert models wherein the routing or gating mechanism leverages input coordinates—such as spatial location, time sample, or feature vector—to direct each sample to specialized sub-networks (experts). This approach enables the model to learn localized, piecewise continuous functions and to perform sparse, high-dimensional regression or reconstruction tasks efficiently. Key developments include the application in implicit neural representations, coordinated gating for feature selection, and the recent innovation of Cartesian product routing for knowledge sharing.

1. Model Architectures and Mathematical Formulations

The coordinate-guided MoE framework generalizes the conventional MoE by explicitly using the input coordinate for gating and routing. In one canonical instantiation for regression with KK experts (Chamroukhi et al., 2018), given dataset D={(xi,yi)}i=1nD = \{(x_i, y_i)\}_{i=1}^n, input xiRpx_i \in \mathbb{R}^p:

  • Gating function: Parameters w=(wk0,wk)k=1K1w = (w_{k0}, w_k)_{k=1}^{K-1}; softmax gating probabilities

πk(xi;w)=exp{wk0+xiTwk}1+l=1K1exp{wl0+xiTwl}\pi_k(x_i; w) = \frac{\exp\{w_{k0} + x_i^T w_k\}}{1 + \sum_{l=1}^{K-1} \exp\{w_{l0} + x_i^T w_l\}}

with πK(xi;w)=1k=1K1πk(xi;w)\pi_K(x_i; w) = 1 - \sum_{k=1}^{K-1} \pi_k(x_i; w).

  • Expert function: For each kk, Gaussian regression

fk(yixi;βk,σk2)=N(yi;βk0+xiTβk,σk2)f_k(y_i \mid x_i; \beta_k, \sigma_k^2) = \mathcal{N}(y_i; \beta_{k0} + x_i^T \beta_k, \sigma_k^2)

  • Marginal Model:

f(yixi;θ)=k=1Kπk(xi;w)fk(yixi;βk,σk2)f(y_i \mid x_i; \theta) = \sum_{k=1}^K \pi_k(x_i; w) \, f_k(y_i \mid x_i; \beta_k, \sigma_k^2)

where θ=(w,{βk,σk2}k=1K)\theta=(w, \{\beta_k, \sigma_k^2\}_{k=1}^K).

For neural implicit representation tasks (Ben-Shabat et al., 29 Oct 2024), the input coordinate xRdx \in \mathbb{R}^d passes through:

  • A shared expert-encoder ΦeE(x;θeE)heRH\Phi_e^E(x; \theta_e^E) \rightarrow h_e \in \mathbb{R}^H,
  • NN expert networks: Each Φe(i)(he;θe(i))fi(x)Rc\Phi_e^{(i)}(h_e; \theta_e^{(i)}) \rightarrow f_i(x) \in \mathbb{R}^c,
  • Manager (gating) network with manager-encoder ΦmE(x;θmE)hmRH\Phi_m^E(x; \theta_m^E) \rightarrow h_m \in \mathbb{R}^H and routing MLP Φm([hm;he];θm)uRN\Phi_m([h_m; h_e]; \theta_m) \rightarrow u \in \mathbb{R}^N.

The gating softmax yields expert weights:

αi(x)=exp(ui(x))j=1Nexp(uj(x)),i=1N\alpha_i(x) = \frac{\exp(u_i(x))}{\sum_{j=1}^N \exp(u_j(x))}, \quad i = 1 \ldots N

Final prediction is a soft mixture during training:

y^(x)=i=1Nαi(x)fi(x)\hat{y}(x) = \sum_{i=1}^N \alpha_i(x) \cdot f_i(x)

and hard-routing during inference:

j=argmaxiαi(x),y^(x)=fj(x)j^* = \arg\max_i \alpha_i(x), \quad \hat{y}(x) = f_{j^*}(x)

2. Training Regimes and Algorithmic Details

Coordinate-guided MoE models frequently rely on expectation-maximization (EM) strategies for classical regression (Chamroukhi et al., 2018) and gradient-based optimizers for neural architectures (Ben-Shabat et al., 29 Oct 2024). Regularized variants introduce sparsity into gating and expert parameters via 1\ell_1 and elastic-net penalties:

Regularized Penalized Log-Likelihood (for regression MoE)

PL(θ)=L(θ)k=1Kλkβk1k=1K1γkwk1ρ2k=1K1wk22PL(\theta) = L(\theta) - \sum_{k=1}^K \lambda_k \|\beta_k\|_1 - \sum_{k=1}^{K-1} \gamma_k \|w_k\|_1 - \frac{\rho}{2}\sum_{k=1}^{K-1} \|w_k\|_2^2

Coordinate-ascent EM (abbreviated Algorithm)

1
2
3
4
5
6
7
8
9
10
11
12
for i in range(n):
    for k in range(K):
        τ_ik = π_k(x_i; w) * N(y_i; β_k0 + x_i.T @ β_k, σ_k**2)
        τ_ik /= sum over l
for k in range(K-1):
    for j in range(p+1):  # intercept and covariates
        w_kj  One-dimensional Newton-Raphson maximization of Q(w_{kj})
for k in range(K):
    for j in range(p):
        β_kj  soft-thresholded update for each coordinate
    β_k0  weighted mean update
    σ_k^2  weighted variance update
For neural experts (Ben-Shabat et al., 29 Oct 2024), Adam optimizer is employed with task-specific scheduling:

  • Images/audio: lr=1e51\mathrm{e}{-5}, exponential decay.
  • Surfaces: lr=5e35\mathrm{e}{-3}, decay by $0.9999$ per iteration.

A two-stage schedule trains all parameters jointly (80% iterations), then freezes the gate and encoder, fine-tuning experts only (20%).

3. Conditioning, Pretraining, and Expert Utilization

An essential innovation in coordinate-guided MoE is manager conditioning (Ben-Shabat et al., 29 Oct 2024). The gating network receives both hmh_m (manager-encoded xx) and heh_e (shared expert encoding), concatenated to form the input to the routing MLP. Empirical ablations indicate that this concatenation outperforms alternatives (no conditioning or pooling).

To address expert starvation—i.e., some experts being unused—the gating network receives pretraining on random, balanced segmentation:

Lseg=Ex[i=1N1yseg(x)=ilogαi(x)]L_\mathrm{seg} = \mathbb{E}_x [ - \sum_{i=1}^N 1_{y_\mathrm{seg}(x)=i} \log \alpha_i(x) ]

where yseg(x)y_\mathrm{seg}(x) is a random balanced assignment. This pretraining ensures initial uniform expert utilization.

Once gating is pretrained, model transitions to standard reconstruction loss:

LReconMoE=Ex[i=1Nαi(x)fi(x)ygt(x)2]/NL_\mathrm{Recon-MoE} = \mathbb{E}_x \Big[ \sum_{i=1}^N \alpha_i(x) \|f_i(x) - y_\mathrm{gt}(x)\|^2 \Big] / N

4. Quantitative Performance and Application Domains

Coordinate-guided MoE models achieve notable improvements across several tasks.

Task Baseline MoE Variant Metrics
Image (Kodak 24) Base MLP (99K) Neural Experts (366K) PSNR: 57.23dB → 89.35dB
Audio SIREN Neural Experts Bach MSE=0.710.12=0.71\to0.12; Two Speakers 2.060.162.06\to0.16
Surface SDF SIREN Large (1.5M) Neural Experts Large (1.3M) Trimap-IoU: 0.66620.81800.6662 \to 0.8180; Chamfer: 5.405.095.40\to5.09

Qualitative results reveal sharper image edges, interpretable expert segmentations, and superior detail in surface and audio reconstructions. Convergence curves show Neural Experts outperform MLP-based INRs, achieving high PSNR >10×>10\times faster.

5. Specialized Routing via Cartesian Product and Knowledge Sharing

Recent work on CartesianMoE (Su et al., 21 Oct 2024) advances the routing paradigm by introducing “multiplicative” knowledge sharing. Here, the expert space is factored into two sets (“A” and “B”), each with its own router. The final gating over e2e^2 composite experts forms as the product of the routers' softmax outputs:

Gij(t)=ri1(t)rj2(t)G_{ij}(t) = r^1_i(t) \cdot r^2_j(t)

for input hth^t. This supports distributed knowledge among composite experts, scales efficiently ($2e$ subnetworks for e2e^2 combinations), and yields empirical improvements in perplexity and downstream accuracy versus top-K and addition-manner MoE. The approach is robust to routing noise and supports extensions to higher-order products (e.g., three routers e3\rightarrow e^3 composite experts).

6. Limitations, Scalability, and Interpretability

Key limitations of coordinate-guided MoE models include increased training cost (all experts evaluated per sample, though inference is efficient), spectral bias with SoftPlus activations, and incompatibility with models lacking per-sample loss (e.g., standard NeRF). Semantic segmentation as additional supervision improves interpretability but does not enhance reconstruction accuracy or convergence.

A plausible implication is that, for large NN or extremely high-dimensional data, computational and memory overheads may demand specialized parallel strategies or low-rank expert representations. Empirical findings suggest that this locality and sparsity induce benefits in both generalization and resource usage, particularly when compared to global MLP or fully dense expert models.

7. Model Selection and Hyperparameter Tuning

Optimal regularization parameters are selected via modified Bayesian Information Criterion (BIC):

BIC(K,λ,γ)=L(θ^)logn2DF(λ,γ)\mathrm{BIC}(K, \lambda, \gamma) = L(\hat{\theta}) - \frac{\log n}{2} \mathrm{DF}(\lambda, \gamma)

where DF\mathrm{DF} counts nonzero coefficients. Typical grid choices set λk,γkO(n)\lambda_k, \gamma_k \sim O(\sqrt{n}) and ρ0.1logn\rho \approx 0.1 \log n.

Coordinate-ascent EM scales robustly to moderate dimensions. For high-dimensional problems, proximal-Newton updates for gating yield further acceleration.


Coordinate-guided MoE frameworks establish a rigorous methodology for partitioning function approximation, enhancing both accuracy and efficiency in heterogeneous data modeling. By explicitly leveraging spatial, temporal, or feature-space coordinates for expert assignment, these architectures extend the expressive power of expert models and introduce principled mechanisms for gate conditioning, segmentation, and multiplicative knowledge sharing. Their application in INRs, high-dimensional regression, and scalable transformers demonstrates versatility—while continuing research addresses optimization scalability and model interpretability.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Coordinate-Guided Mixture of Experts (MoE).