SoCE: Composing Systems with Category Experts
- SoCE is a framework that leverages distinct category experts to optimize systems through specialized, complementary contributions.
- It formalizes optimization using mathematical models across crowdsourced knowledge, neural model mixtures, and non-uniform LLM souping.
- Empirical results demonstrate consistent performance gains, efficient parameter averaging, and robust adaptability in diverse applications.
Soup Of Category Experts (SoCE) denotes a family of principled methods for composing systems from specialist entities corresponding to distinct categories—whether users in collaborative knowledge settings, neural modules for category-specific prediction, or parameter sets representing per-category expertise. SoCE aims to optimize system-level performance by judiciously combining outputs, resources, or model weights from these category experts, leveraging specialization, inter-category complementarity, and explicit category decomposition. SoCE methodologies appear across theoretical, algorithmic, and empirical research in collaborative crowdsourcing, machine learning (parameter averaging and Mixture-of-Experts), and LLM systems via non-uniform model souping.
1. Category Decomposition and the SoCE Principle
A core premise of SoCE is that performance on complex, heterogeneous tasks or knowledge-building endeavors is maximized by mixing together multiple "category experts," each specialized for a subset, stratum, or category of the global problem. Categories are mutually exclusive and collectively exhaustive partitions (e.g., K categories of data, k user roles, or C benchmark benchmarks).
- In knowledge-building, users are partitioned into categories , each specializing in a type of contribution.
- In machine learning model composition, a model's parameters are decomposed into a shared base plus expert parameter blocks , each associated with a category or cluster (Ablin et al., 3 Feb 2025).
- In LLM post-training, SoCE identifies model checkpoints that are "specialists" on weakly-correlated benchmark categories and combines them using optimized non-uniform averaging (Maiti et al., 17 Nov 2025).
The SoCE principle exploits the diversity and complementarity inherent in such decompositions to yield total system performance that exceeds naïve uniform mixing.
2. Formalization: Mathematical and Algorithmic Foundations
Several formalisms of SoCE are represented across domains:
a) Crowdsourced Knowledge Building
Let be the count of users in category . User contributions are measured as knowledge units (KUs). A triggering matrix encodes cross-category stimulation rates, where is the average number of KUs in category triggered by one KU from . Each user in independently yields internal KUs.
Under a linear-triggering, steady-state assumption (Chhabra et al., 2015): yields the vector of total KUs per category, with and . The grand total
is then maximized over .
b) Neural Model Mixtures and Parameter Soups
For language modeling, SoCE parameterizes the model instantiated for a given category mixture as
where is a learned function (e.g., a small MLP) of the specialization vector (histogram or mixture over K categories) (Ablin et al., 3 Feb 2025).
The SoCE training objective seeks to minimize expected loss over category mixtures defined by a meta-distribution over : with .
c) Weighted Model Soup for LLMs
Given a set of candidate models and per-category performance matrix , SoCE solves: and then forms the final soup as , with selected as the top expert per weakly correlated category cluster (Maiti et al., 17 Nov 2025).
3. Algorithms and Implementation Methods
a) Optimization in Knowledge Ecosystems
A hill-climbing algorithm solves for the optimal user-category assignment maximizing . The search iteratively moves individual users across categories to local neighbors and accepts moves that increase total knowledge, with convergence assured under empirically observed convexity (Chhabra et al., 2015). Computational complexity is to per full run.
b) Specialist Model Instantiation via Parameter Averaging
At inference, given a desired category mix , a single forward pass through the coefficient prediction MLP yields , followed by linear parameter aggregation. This allows near-instant, per-category specialization, or "flash-specialization," within a fixed model budget (Ablin et al., 3 Feb 2025). Pretraining overhead is minor, dominated by updating expert Adam states.
c) SoCE for Model Souping
SoCE clusters categories by analyzing the Pearson correlation matrix of per-model, per-category results: Hierarchical clustering segments categories; in each cluster, the best model is selected as the cluster expert. Weights over the cluster experts are optimized either via grid search on the simplex (coarse grids suffice due to small ) or any convex solver (Maiti et al., 17 Nov 2025).
Algorithmic pseudocode for SoCE in LLM souping is given directly in (Maiti et al., 17 Nov 2025), describing steps from correlation computation, clustering, expert selection, and soup weight optimization.
4. Empirical Results and Benchmarks
a) LLM Model Souping and Zero-Finetune Specialization
On the Berkeley Function Calling Leaderboard (BFCL), SoCE delivers improved performance over all baselines (best individual, uniform soup, pruned soup) at both 70B and 8B scales. For example, SoCE achieves $80.68$ BFCL accuracy (+2.7\% over the best single model) using the candidate set (xLAM-2-70b-fc-r, CoALM-70B, watt-tool-70B, Functionary-Medium-70B), and $76.50$ for smaller 8B models (Maiti et al., 17 Nov 2025). Similar gains are seen on MGSM benchmark and ∞-Bench (long-context).
Ablation studies show that non-uniform weights are essential, yielding additional $2$–$3$\% improvement over uniform weights. Pruning to anti-correlated category experts further prevents regression seen in "all-candidate" souping.
b) Parameter-Averaged Category Experts
On 16 Pile domains, SoCE instantiation (direct parameter averaging guided by a learned from ) yields lowest loss vs. generic pretrain and outperforms one-per-domain fine-tuned/CRISP baselines on language modeling, without requiring fine-tuning per new domain. Flash-specialist instantiation adds ms at inference (Ablin et al., 3 Feb 2025).
Ablations confirm robustness to model scale, minimal data needs for good specialization, and the superior effectiveness of dense over low-rank experts at fixed parameter budget.
c) Crowdsourced Environment Optimization
Synthetic experiments with users and categories show that the SoCE-optimized distribution ($39/30/31$) yields a –$3$\% increase in total KUs ( vs. $3000$ for uniform split) (Chhabra et al., 2015). This substantiates the thesis that careful "mixing" of user categories, leveraging cross-category triggering, produces a measurable gain.
d) MoE Architectures with Category Hierarchy
Adversarial MoE architectures, when combined with category-driven gating, adversarial regularization (enforcing expert output diversity), and hierarchy-based soft constraints (smoothing expert assignment across closely related categories), yield statistically significant improvements in AUC and ranking metrics for recommendation and search tasks. Gains are particularly pronounced for data-sparse sub-categories (Xiao et al., 2020).
5. Analysis, Robustness, and Limitations
SoCE methods uniformly exhibit the following characteristics:
- Specialization Retention: Non-uniform weighting allows near-complete retention of per-category expert performance, while recovering a significant fraction of tasks not solved by any individual model through cross-category synergy (Maiti et al., 17 Nov 2025).
- Data Scarcity Resilience: Specialist estimation and mixture weights require minimal data (often samples for competitive specialist performance in parameter-averaged models) (Ablin et al., 3 Feb 2025), and hierarchy constraints allow experts to borrow strength in low-data regimes (Xiao et al., 2020).
- Optimization Complexity: All SoCE formulations offer tractable optimization (hill-climbing, convex program or MLP-based coefficient regression), with minimal additional inference or training overhead compared to baselines.
- Cluster Granularity Sensitivity: Over-clustering (setting category correlation thresholds too high) can lead to singleton experts and overfitting; too coarse clusters dilute specialization (Maiti et al., 17 Nov 2025). Empirically, a correlation threshold of balances specialization and generalization.
- Limitations: Applicability depends on available per-category benchmarks and candidate experts (or models), reliance on sufficient data per sub-category to estimate correlations/trigger matrices, and current implementations often assume fixed architecture or initialization across experts.
6. Applications and Practical Guidelines
a) LLMs and Tool-Calling
SoCE is state-of-the-art for model selection and composition in LLMs evaluated on tool-calling, multilingual, and math benchmarks, providing modular flash specialization and robust gains without retraining (Maiti et al., 17 Nov 2025). Deployment is practical, requiring only final model checkpoints and per-category evaluation metrics.
b) Knowledge Ecosystems
In online collaborative systems, SoCE methodology prescribes (i) identification of contribution categories, (ii) empirical estimation of triggering, (iii) internal productivity measurement, (iv) hill-climbing optimization, and (v) live population steering towards the optimal expert distribution, resulting in maximized collective knowledge contribution (Chhabra et al., 2015).
c) MoE-based Recommender and Ranking Systems
Explicit category-aware gating, adversarial diversity regularization, and hierarchy-based soft constraints comprise an SoCE instantiation for recommender architectures, yielding improved category granularity, performance in long-tail scenarios, and interpretable expert assignments (Xiao et al., 2020).
References:
- (Maiti et al., 17 Nov 2025) Souper-Model: How Simple Arithmetic Unlocks State-of-the-Art LLM Performance.
- (Ablin et al., 3 Feb 2025) Soup-of-Experts: Pretraining Specialist Models via Parameters Averaging.
- (Xiao et al., 2020) Adversarial Mixture Of Experts with Category Hierarchy Soft Constraint.
- (Chhabra et al., 2015) Ideal Composition of a Group for Maximal Knowledge Building in Crowdsourced Environments.