Condition Modulated Expert (CoMoE) Systems

Updated 4 July 2026

Condition Modulated Expert (CoMoE) is a framework where external conditions, such as SNR, task embeddings, or latent variables, inform expert routing and computation.
It dynamically assigns specialized subnetworks to different input regimes, enabling tailored processing through methods like soft gating and top-K routing.
CoMoE applications span domains like ECG analysis, image generation, and robot co-design, demonstrating improved efficiency and performance via modular specialization.

Searching arXiv for papers on condition-modulated experts and related MoE formulations. Condition Modulated Expert (CoMoE) denotes a class of mixture-of-experts architectures in which an auxiliary condition modulates expert selection, expert computation, or both. In the literature represented here, the condition may be an estimated Signal-to-Noise Ratio (SNR), a task embedding, a global signal summary, cardiac periodic structure, textual context from a LLM, a robot’s latent genotype, or a condition-type embedding in controllable image generation. The common design principle is to avoid a one-size-fits-all model by allocating distinct subnetworks to distinct regimes of the input-condition space. The acronym is not fully standardized: in controllable image generation, CoMoE is explicitly used for “Condition Modulated Expert,” whereas in parameter-efficient fine-tuning it also names a different method, “Contrastive Representation for Mixture-of-Experts,” whose main contribution is contrastive expert specialization rather than external-condition modulation (Zhang et al., 24 Aug 2025, Feng et al., 23 May 2025).

1. Definition, scope, and nomenclature

A condition-modulated expert system differs from a conventional MoE in where the routing signal comes from and how the condition enters the model. Standard MoE formulations typically gate experts from the current hidden representation or input token. By contrast, the systems described here use a condition variable that is externally meaningful for the task: SNR in automatic modulation classification, task identity and periodic structure in ECG analysis, embodiment coordinates in robot co-design, or text context in multi-modal forecasting (Gao et al., 2023, Xu et al., 4 Mar 2026, Wang et al., 22 May 2026, Zhang et al., 29 Jan 2026).

This family is not architecturally uniform. Some instances use soft routing over experts, some use sparse top- $K$ activation, and some modulate not only routing but also the internal computation of each expert. In MoME for multi-modal time series prediction, textual context changes both router scores and expert outputs; in ECG-MoE, the routing is softmax-based and task-conditioned; in ECo-MoE, the gate is driven by latent embodiment coordinates rather than token states; and in UniGen, CoMoE routes semantically similar patch tokens to condition-aware expert modules (Zhang et al., 29 Jan 2026, Xu et al., 4 Mar 2026, Wang et al., 22 May 2026, Zhang et al., 24 Aug 2025).

A recurrent source of ambiguity is the acronym itself. “CoMoE” in the PEFT paper on contrastive modularization refers to “Contrastive Representation for Mixture-of-Experts in Parameter-Efficient Fine-tuning,” not to condition modulation. That work is nonetheless relevant because it addresses expert specialization, redundancy, and modularity, which are central concerns for condition-aware expert systems as well (Feng et al., 23 May 2025).

2. Core computational pattern

The simplest condition-modulated pattern is a gated weighted combination of specialized experts. In MoE-AMC, a gating MLP outputs $y_{high}$ , the probability that an input belongs to the high-SNR category, and the final prediction is

$y_{final} = y_{high} \cdot y_{hsnr} + (1 - y_{high}) \cdot y_{lsnr},$

where $y_{hsnr}$ and $y_{lsnr}$ are outputs from the high-SNR and low-SNR experts, respectively. This avoids a non-differentiable threshold rule while still expressing a condition-dependent division of labor (Gao et al., 2023).

A more general form appears in ECG-MoE. Its periodic expert gate is

$\mathbf{g}_p = \mathrm{softmax}\left(\mathbf{U}_p [\bar{\mathbf{X}} \oplus \mathbf{e}_t]\right),$

where $\bar{\mathbf{X}}$ is a global summary of the ECG signal and $\mathbf{e}_t$ is a task embedding. Expert outputs are then combined as

$\mathbf{h}_p = \sum_{i=1}^{N} g_{p,i} \, E_i(\cdot).$

Here the condition is jointly task-conditioned and signal-conditioned, and the model distinguishes between experts for intra-beat morphology and inter-beat rhythm (Xu et al., 4 Mar 2026).

MoME makes the conditioning mechanism more explicit by separating router modulation and expert modulation. For a time-series token $\mathbf{x}_p$ , the context-conditioned router is

$y_{high}$ 0

and the expert output is modulated as

$y_{high}$ 1

The final output is a Top- $y_{high}$ 2 weighted sum of these modulated experts. This is a direct example of the condition affecting both expert choice and expert behavior, rather than being fused only at the token level (Zhang et al., 29 Jan 2026).

In embodiment-conditioned control, the same pattern appears with a different conditioning source. ECo-MoE uses

$y_{high}$ 3

where $y_{high}$ 4 is the latent genotype, and the policy is

$y_{high}$ 5

The conditioning variable is therefore the robot’s embodiment itself, not the current sensory input alone (Wang et al., 22 May 2026).

In image generation, UniGen’s CoMoE first predicts expert scores from fused global and conditional features,

$y_{high}$ 6

then restores token order after expert processing with the routing index $y_{high}$ 7. This formulation emphasizes semantic token grouping and condition-aware expert assignment at the patch level (Zhang et al., 24 Aug 2025).

3. Conditioning signals and domain-specific specialization

The condition signal varies substantially by application, and the specialization structure follows the domain prior.

System	Domain	Conditioning signal
MoE-AMC	Automatic modulation classification	Estimated high-SNR vs low-SNR regime
ECG-MoE	ECG foundation modeling	Task embedding, global ECG summary, cardiac period structure
ECo-MoE	Robot co-design	Latent genotype $y_{high}$ 8
MoME	Multi-modal time series prediction	Textual context from an LLM
UniGen CoMoE	Controllable image generation	Condition-type embedding, prompt embedding, condition-image features

In MoE-AMC, the conditioning variable is SNR. The model uses a ResNet-based HSRM for high-SNR signals and a Transformer-based LSRM for low-SNR signals. The rationale given is that high-SNR inputs expose cleaner local structures, which favors CNN/ResNet feature extraction, whereas low-SNR inputs benefit from self-attention’s ability to model global dependencies and suppress irrelevant noise-induced variation (Gao et al., 2023).

ECG-MoE uses both explicit and implicit physiological conditioning. Explicitly, the gate depends on a task embedding $y_{high}$ 9 and a global ECG summary $y_{final} = y_{high} \cdot y_{hsnr} + (1 - y_{high}) \cdot y_{lsnr},$ 0. Implicitly, R-peak detection segments the ECG into beats, creating a cardiac period prior that separates within-cycle morphology from inter-beat rhythm. The periodic expert network therefore contains three CNN experts with different kernel sizes for beat morphology and two dilated CNNs for rhythm patterns (Xu et al., 4 Mar 2026).

ECo-MoE conditions expert usage on embodiment coordinates in latent design space. The genotype $y_{final} = y_{high} \cdot y_{hsnr} + (1 - y_{high}) \cdot y_{lsnr},$ 1 is sampled from $y_{final} = y_{high} \cdot y_{hsnr} + (1 - y_{high}) \cdot y_{lsnr},$ 2, decoded into a body plan, and used directly by the gate. Different body plans therefore activate different subsets of learned sensorimotor circuits. The authors frame this as a middle ground between a monolithic universal controller and training a separate controller for every robot (Wang et al., 22 May 2026).

MoME uses textual context as the conditioning variable for multi-modal time series prediction. A pretrained LLM encodes the text, learnable queries distill it into a compact context representation, and that representation modulates routing and expert computation. This shifts the role of text from token-level fusion to direct functional control over the MoE backbone (Zhang et al., 29 Jan 2026).

UniGen’s CoMoE uses multiple condition-linked signals: condition-image features $y_{final} = y_{high} \cdot y_{hsnr} + (1 - y_{high}) \cdot y_{lsnr},$ 3, prompt embedding $y_{final} = y_{high} \cdot y_{hsnr} + (1 - y_{high}) \cdot y_{lsnr},$ 4, pooled condition embedding $y_{final} = y_{high} \cdot y_{hsnr} + (1 - y_{high}) \cdot y_{lsnr},$ 5, and pooled prompt embedding $y_{final} = y_{high} \cdot y_{hsnr} + (1 - y_{high}) \cdot y_{lsnr},$ 6. Foreground patches with high feature similarity are grouped and routed to dedicated experts, while background regions are handled by a separate expert module. This suggests a condition-modulated expert design can also serve as a mechanism for disentangling condition-specific foreground processing from shared global structure (Zhang et al., 24 Aug 2025).

4. Modularity, specialization, and the PEFT connection

The PEFT paper titled CoMoE addresses a related but distinct problem: MoE-style adapters may add expert capacity without using it effectively on heterogeneous or multi-task data because experts drift toward similar representations. The paper identifies two failures—knowledge redundancy and load imbalance—and proposes a contrastive objective that distinguishes activated from inactivated experts under top- $y_{final} = y_{high} \cdot y_{hsnr} + (1 - y_{high}) \cdot y_{lsnr},$ 7 routing (Feng et al., 23 May 2025).

Its MoE output is

$y_{final} = y_{high} \cdot y_{hsnr} + (1 - y_{high}) \cdot y_{lsnr},$ 8

with LoRA-style experts $y_{final} = y_{high} \cdot y_{hsnr} + (1 - y_{high}) \cdot y_{lsnr},$ 9. Under top- $y_{hsnr}$ 0 routing, the selected experts are treated as activated experts $y_{hsnr}$ 1, and the others as inactivated experts $y_{hsnr}$ 2. The paper defines an MI gap

$y_{hsnr}$ 3

and estimates it with an InfoNCE-style contrastive objective. Activated experts are pulled closer to the input representation, while inactivated experts are pushed away. The intended effect is modularization: activated experts become more predictive of the current input, while inactive experts avoid redundant overlap (Feng et al., 23 May 2025).

This PEFT formulation is not condition modulation in the narrow sense of routing by an external variable such as SNR or task metadata. However, it addresses a foundational issue for all expert systems: whether the expert pool actually becomes diverse, specialized, and well utilized. A plausible implication is that contrastive modularization and condition-aware routing are complementary. One manages the semantics of expert separation; the other determines which expert should dominate under which condition.

Experimentally, that PEFT CoMoE is instantiated as CoMoE-LoRA and CoMoE-DoRA on LLaMA-2 7B and Gemma 2B, usually with rank $y_{hsnr}$ 4, 4 experts, and top-2 routing on $y_{hsnr}$ 5. The paper reports that the best $y_{hsnr}$ 6 among $y_{hsnr}$ 7 is $y_{hsnr}$ 8, and that the strongest results come from applying CoMoE in lower layers or in both low and high layers (Feng et al., 23 May 2025).

5. Reported empirical behavior across domains

In automatic modulation classification, MoE-AMC is evaluated on RML2018.01a, which contains 24 modulation categories, 2,555,904 samples, and 26 SNR levels from $y_{hsnr}$ 9 dB to 30 dB in 2 dB steps. With a 70/10/20 train-validation-test split, batch size 1024, 500 epochs, Adam, and a single NVIDIA Tesla A100 80GB, the model achieves 71.76% average classification accuracy across SNR levels. The reported baselines include MCFormer at 61.77%, LSTM at 62.51%, PET-CGDNN at 61.53%, MCLDNN at 61.86%, FEA-T at 61.81%, LSRM at 52.30%, and HSRM at 66.05%. The paper attributes the main gain to low-SNR improvement and notes that no extra SNR labels or auxiliary losses are required (Gao et al., 2023).

In ECG analysis, ECG-MoE is evaluated on MIMIC-IV-ECG with 800,035 ECGs from 161,352 patients and five downstream tasks: RR interval estimation, age estimation, sex classification, potassium abnormality prediction, and arrhythmia detection. Reported results include RR interval MAE 76.37, age MAE 12.83, sex F1 0.69, potassium abnormality F1 0.57, and arrhythmia accuracy 0.73. The abstract states 40% faster inference than multi-task baselines, and the experiments additionally report 8.2 GB GPU memory, 14.7 samples/sec, roughly 3× faster than real-time, and 35% reduction in resource consumption (Xu et al., 4 Mar 2026).

In robot co-design, ECo-MoE is tested on Flat Ground, Upright Locomotion, and Potholes. It performs about the same as the universal-controller baseline on Flat Ground, but improves evolvability on Upright Locomotion and Potholes. The paper emphasizes that the effect is not primarily broader exploration of latent space; PCA traces show comparable path length and exploration breadth. Instead, the modular controller appears to improve exploitation, buffer deleterious mutations, and preserve ancestral knowledge. Parameter counts are 2.563M for the baseline policy and 2.388M for ECo-MoE with 4 experts, while the critic is identical at 2.038M (Wang et al., 22 May 2026).

In multi-modal time series prediction, MoME is reported as best or near-best across most tasks on MTBench and TimeMMD, and it also improves multiple uni-modal backbones. The paper further reports that expert modulation outperforms token-level fusion on most tasks, uses less memory than cross-attention fusion, and trains faster. Component ablations indicate that EiLM consistently improves performance, while Router Modulation can help but may also disrupt specialization by changing the Top- $y_{lsnr}$ 0 set too aggressively (Zhang et al., 29 Jan 2026).

In controllable image generation, UniGen reports strong results on Subjects-200K and MultiGen-20M across 12 condition types. On Subjects-200K, the reported means are SSIM 0.48, FID 12.15, CLIP-I 87.82, CLIP-T 20.09, and DINO 92.54. On MultiGen-20M, the reported means are SSIM 0.53, FID 10.57, CLIP-I 84.33, CLIP-T 19.51, and DINO 91.84. The complexity comparison reports, for 3 conditions, 6.03B parameters and 58.12 inference time for ControlNet, 11.93B and 13.06 for OminiControl, and 4.1B and 6.82 for UniGen; for 12 conditions, 17.38B and 59.16 for ControlNet, 12.07B and 15.74 for OminiControl, and 4.69B and 13.96 for UniGen (Zhang et al., 24 Aug 2025).

For the PEFT CoMoE, the main multi-task result is a CoMoE-LoRA average accuracy of 76.2, described as about +1.3 over the strongest baseline shown there. In single-task settings, the paper reports competitiveness with LoRA, DoRA, and MixLoRA, while using roughly 50% fewer tunable parameters than some stronger LoRA baselines. Visualizations indicate that, without contrastive loss, expert usage concentrates on just a couple of experts, whereas with the contrastive term the workload becomes more distributed and task-expert combinations become clearer, even without an explicit routing balance loss (Feng et al., 23 May 2025).

6. Misconceptions, limitations, and open distinctions

A common misconception is to treat all CoMoE-like systems as sparse token-level MoE layers. The surveyed systems are more heterogeneous. ECG-MoE is explicitly described as not a classic token-level sparse MoE, but rather a task-conditioned and cardiac-period-conditioned routing system with softmax weighting. ECo-MoE is conditioned on embodiment coordinates rather than hidden activations. UniGen’s CoMoE groups semantically similar patches and restores them to the original token order after expert processing. The condition-modulated expert idea is therefore broader than standard top- $y_{lsnr}$ 1 token routing (Xu et al., 4 Mar 2026, Wang et al., 22 May 2026, Zhang et al., 24 Aug 2025).

Another misconception is that condition-modulated systems necessarily require explicit supervision for the condition variable. MoE-AMC states that no extra SNR labels or special auxiliary losses are required; the gating behavior emerges from standard cross-entropy training on modulation labels alone. By contrast, other systems use explicit condition representations such as task embeddings, condition-type embeddings, or latent genotype vectors (Gao et al., 2023, Xu et al., 4 Mar 2026, Wang et al., 22 May 2026, Zhang et al., 24 Aug 2025).

The limitations are similarly domain-specific. MoE-AMC assumes that SNR-relevant structure can be inferred from the input signal and does not provide a highly detailed ablation on soft versus hard routing behavior. ECG-MoE is demonstrated on one benchmark dataset and its broader generalization is suggested rather than exhaustively proven. ECo-MoE depends on a VAE latent manifold that only partially reconstructs some demo robots, is evaluated only on simulated terrestrial locomotion, and can suffer expert collapse without routing-diversity regularization. MoME reports that Router Modulation can cause expert collapse or overly concentrated routing. The PEFT CoMoE notes that contrastive training cost scales like $y_{lsnr}$ 2 with the number of negative samples. UniGen reports that a vanilla MoE gives only a small improvement over the baseline, whereas its Modulated Expert plus RoPE and Shared Expert materially improve FID, indicating that ordinary expertization alone may be insufficient for sparse visual control (Zhang et al., 29 Jan 2026, Feng et al., 23 May 2025, Zhang et al., 24 Aug 2025, Wang et al., 22 May 2026).

Taken together, these works indicate that Condition Modulated Expert is best understood not as a single fixed architecture but as a design principle for structured conditional computation: the model identifies a meaningful condition, uses it to modulate expert allocation or transformation, and thereby attempts to improve specialization, balance, robustness, or controllability. The precise mechanism—soft gating, sparse top- $y_{lsnr}$ 3, hierarchical fusion, embodiment-conditioned mixing, or contrastive expert separation—depends on the domain and on what form of conditional heterogeneity the model is designed to capture (Gao et al., 2023, Zhang et al., 29 Jan 2026, Xu et al., 4 Mar 2026, Wang et al., 22 May 2026, Zhang et al., 24 Aug 2025, Feng et al., 23 May 2025).