Coordinate-Guided Mixture of Experts (MoE)

Updated 17 November 2025

Coordinate-Guided MoE is an architectural paradigm that leverages spatial, temporal, or feature coordinates to direct data to specialized expert networks for localized, piecewise continuous approximations.
It integrates sophisticated gating mechanisms with EM or gradient-based optimization to efficiently perform sparse, high-dimensional regression and reconstruction tasks.
Recent innovations like Cartesian product routing and manager conditioning improve expert sharing, reduce expert starvation, and enhance model interpretability and reconstruction quality.

Coordinate-Guided Mixture of Experts (MoE) is an architectural paradigm within expert models wherein the routing or gating mechanism leverages input coordinates—such as spatial location, time sample, or feature vector—to direct each sample to specialized sub-networks (experts). This approach enables the model to learn localized, piecewise continuous functions and to perform sparse, high-dimensional regression or reconstruction tasks efficiently. Key developments include the application in implicit neural representations, coordinated gating for feature selection, and the recent innovation of Cartesian product routing for knowledge sharing.

1. Model Architectures and Mathematical Formulations

The coordinate-guided MoE framework generalizes the conventional MoE by explicitly using the input coordinate for gating and routing. In one canonical instantiation for regression with $K$ experts (Chamroukhi et al., 2018), given dataset $D = \{(x_i, y_i)\}_{i=1}^n$ , input $x_i \in \mathbb{R}^p$ :

Gating function: Parameters $w = (w_{k0}, w_k)_{k=1}^{K-1}$ ; softmax gating probabilities

$\pi_k(x_i; w) = \frac{\exp\{w_{k0} + x_i^T w_k\}}{1 + \sum_{l=1}^{K-1} \exp\{w_{l0} + x_i^T w_l\}}$

with $\pi_K(x_i; w) = 1 - \sum_{k=1}^{K-1} \pi_k(x_i; w)$ .

Expert function: For each $k$ , Gaussian regression

$f_k(y_i \mid x_i; \beta_k, \sigma_k^2) = \mathcal{N}(y_i; \beta_{k0} + x_i^T \beta_k, \sigma_k^2)$

Marginal Model:

$f(y_i \mid x_i; \theta) = \sum_{k=1}^K \pi_k(x_i; w) \, f_k(y_i \mid x_i; \beta_k, \sigma_k^2)$

where $\theta=(w, \{\beta_k, \sigma_k^2\}_{k=1}^K)$ .

For neural implicit representation tasks (Ben-Shabat et al., 29 Oct 2024), the input coordinate $x \in \mathbb{R}^d$ passes through:

A shared expert-encoder $\Phi_e^E(x; \theta_e^E) \rightarrow h_e \in \mathbb{R}^H$ ,
$N$ expert networks: Each $\Phi_e^{(i)}(h_e; \theta_e^{(i)}) \rightarrow f_i(x) \in \mathbb{R}^c$ ,
Manager (gating) network with manager-encoder $\Phi_m^E(x; \theta_m^E) \rightarrow h_m \in \mathbb{R}^H$ and routing MLP $\Phi_m([h_m; h_e]; \theta_m) \rightarrow u \in \mathbb{R}^N$ .

The gating softmax yields expert weights:

$\alpha_i(x) = \frac{\exp(u_i(x))}{\sum_{j=1}^N \exp(u_j(x))}, \quad i = 1 \ldots N$

Final prediction is a soft mixture during training:

$\hat{y}(x) = \sum_{i=1}^N \alpha_i(x) \cdot f_i(x)$

and hard-routing during inference:

$j^* = \arg\max_i \alpha_i(x), \quad \hat{y}(x) = f_{j^*}(x)$

2. Training Regimes and Algorithmic Details

Coordinate-guided MoE models frequently rely on expectation-maximization (EM) strategies for classical regression (Chamroukhi et al., 2018) and gradient-based optimizers for neural architectures (Ben-Shabat et al., 29 Oct 2024). Regularized variants introduce sparsity into gating and expert parameters via $\ell_1$ and elastic-net penalties:

Regularized Penalized Log-Likelihood (for regression MoE)

$PL(\theta) = L(\theta) - \sum_{k=1}^K \lambda_k \|\beta_k\|_1 - \sum_{k=1}^{K-1} \gamma_k \|w_k\|_1 - \frac{\rho}{2}\sum_{k=1}^{K-1} \|w_k\|_2^2$

Coordinate-ascent EM (abbreviated Algorithm)

for i in range(n):
    for k in range(K):
        τ_ik = π_k(x_i; w) * N(y_i; β_k0 + x_i.T @ β_k, σ_k**2)
        τ_ik /= sum over l
for k in range(K-1):
    for j in range(p+1):  # intercept and covariates
        w_kj ← One-dimensional Newton-Raphson maximization of Q(w_{kj})
for k in range(K):
    for j in range(p):
        β_kj ← soft-thresholded update for each coordinate
    β_k0 ← weighted mean update
    σ_k^2 ← weighted variance update

For neural experts (Ben-Shabat et al., 29 Oct 2024), Adam optimizer is employed with task-specific scheduling:

Images/audio: lr= $1\mathrm{e}{-5}$ , exponential decay.
Surfaces: lr= $5\mathrm{e}{-3}$ , decay by $0.9999$ per iteration.

A two-stage schedule trains all parameters jointly (80% iterations), then freezes the gate and encoder, fine-tuning experts only (20%).

3. Conditioning, Pretraining, and Expert Utilization

An essential innovation in coordinate-guided MoE is manager conditioning (Ben-Shabat et al., 29 Oct 2024). The gating network receives both $h_m$ (manager-encoded $x$ ) and $h_e$ (shared expert encoding), concatenated to form the input to the routing MLP. Empirical ablations indicate that this concatenation outperforms alternatives (no conditioning or pooling).

To address expert starvation—i.e., some experts being unused—the gating network receives pretraining on random, balanced segmentation:

$L_\mathrm{seg} = \mathbb{E}_x [ - \sum_{i=1}^N 1_{y_\mathrm{seg}(x)=i} \log \alpha_i(x) ]$

where $y_\mathrm{seg}(x)$ is a random balanced assignment. This pretraining ensures initial uniform expert utilization.

Once gating is pretrained, model transitions to standard reconstruction loss:

$L_\mathrm{Recon-MoE} = \mathbb{E}_x \Big[ \sum_{i=1}^N \alpha_i(x) \|f_i(x) - y_\mathrm{gt}(x)\|^2 \Big] / N$

4. Quantitative Performance and Application Domains

Coordinate-guided MoE models achieve notable improvements across several tasks.

Task	Baseline	MoE Variant	Metrics
Image (Kodak 24)	Base MLP (99K)	Neural Experts (366K)	PSNR: 57.23dB → 89.35dB
Audio	SIREN	Neural Experts	Bach MSE $=0.71\to0.12$ ; Two Speakers $2.06\to0.16$
Surface SDF	SIREN Large (1.5M)	Neural Experts Large (1.3M)	Trimap-IoU: $0.6662 \to 0.8180$ ; Chamfer: $5.40\to5.09$

Qualitative results reveal sharper image edges, interpretable expert segmentations, and superior detail in surface and audio reconstructions. Convergence curves show Neural Experts outperform MLP-based INRs, achieving high PSNR $>10\times$ faster.

Recent work on CartesianMoE (Su et al., 21 Oct 2024) advances the routing paradigm by introducing “multiplicative” knowledge sharing. Here, the expert space is factored into two sets (“A” and “B”), each with its own router. The final gating over $e^2$ composite experts forms as the product of the routers' softmax outputs:

$G_{ij}(t) = r^1_i(t) \cdot r^2_j(t)$

for input $h^t$ . This supports distributed knowledge among composite experts, scales efficiently ($2e$ subnetworks for $e^2$ combinations), and yields empirical improvements in perplexity and downstream accuracy versus top-K and addition-manner MoE. The approach is robust to routing noise and supports extensions to higher-order products (e.g., three routers $\rightarrow e^3$ composite experts).

6. Limitations, Scalability, and Interpretability

Key limitations of coordinate-guided MoE models include increased training cost (all experts evaluated per sample, though inference is efficient), spectral bias with SoftPlus activations, and incompatibility with models lacking per-sample loss (e.g., standard NeRF). Semantic segmentation as additional supervision improves interpretability but does not enhance reconstruction accuracy or convergence.

A plausible implication is that, for large $N$ or extremely high-dimensional data, computational and memory overheads may demand specialized parallel strategies or low-rank expert representations. Empirical findings suggest that this locality and sparsity induce benefits in both generalization and resource usage, particularly when compared to global MLP or fully dense expert models.

7. Model Selection and Hyperparameter Tuning

Optimal regularization parameters are selected via modified Bayesian Information Criterion (BIC):

$\mathrm{BIC}(K, \lambda, \gamma) = L(\hat{\theta}) - \frac{\log n}{2} \mathrm{DF}(\lambda, \gamma)$

where $\mathrm{DF}$ counts nonzero coefficients. Typical grid choices set $\lambda_k, \gamma_k \sim O(\sqrt{n})$ and $\rho \approx 0.1 \log n$ .

Coordinate-ascent EM scales robustly to moderate dimensions. For high-dimensional problems, proximal-Newton updates for gating yield further acceleration.

Coordinate-guided MoE frameworks establish a rigorous methodology for partitioning function approximation, enhancing both accuracy and efficiency in heterogeneous data modeling. By explicitly leveraging spatial, temporal, or feature-space coordinates for expert assignment, these architectures extend the expressive power of expert models and introduce principled mechanisms for gate conditioning, segmentation, and multiplicative knowledge sharing. Their application in INRs, high-dimensional regression, and scalable transformers demonstrates versatility—while continuing research addresses optimization scalability and model interpretability.