- The paper proves that increasing granularity of experts exponentially improves expressivity while keeping active parameters constant.
- It develops rigorous proofs for constant, linear, and ReLU activations, linking combinatorial expert configurations to enhanced function approximation.
- Experimental results confirm that matching granularity between teacher and student MoE models is crucial for achieving low test error in practice.
Mixture-of-Experts (MoE) layers are a critical component in LLMs, allowing for massive parameter counts while keeping computational costs manageable by activating only a fraction of parameters per input. A key design parameter for MoE layers is "granularity," defined as the number of experts activated per token. Despite its importance, there is no widespread consensus on the optimal granularity, with frontier models like DeepSeek-V3 using higher granularity (8 active experts) compared to others like Llama-4 (1 active expert). This paper investigates how granularity impacts model expressivity.
The core contribution of this research is a theoretical proof demonstrating that increasing the granularity of an MoE layer exponentially boosts its expressivity, even when the total number of active parameters remains constant. This finding provides theoretical justification for empirical observations, such as those from DeepSeek-V3, where higher granularity has been shown to improve performance.
The paper establishes this expressivity separation by showing that for MoE layers with constant, linear, or ReLU activation functions, an MoE with higher granularity k out of m total experts can compute functions that cannot be well-approximated by an MoE with lower granularity k′ out of m′ total experts, assuming both architectures have the same number of active parameters (kw≈k′w′) but (km)≫(k′m′). The number of possible active expert configurations, (km), is identified as a key combinatorial quantity governing expressivity; for fixed k, this quantity scales as mk, highlighting the exponential benefit of granularity.
For constant activation functions (σ(t)=1), the separation is intuitive: the MoE computes piecewise constant functions, and the number of distinct constant values it can output is at most (km). A lower-granularity MoE with the same active parameters but fewer possible configurations simply cannot match the variety of constant outputs.
For linear activation functions (σ(t)=t), the separation requires experts to have sufficient width (w) and relies on properties of the input distribution (standard Gaussian or uniform over the unit ball). The proof involves constructing routing vectors that partition the input space into regions corresponding to expert configurations and constructing linear experts such that the combined linear function ∑j∈SMjx is distinct for different sets S. A key technical insight is that subsets of the input space with large probability mass must have high-rank covariance matrices (Lemma A.6), making it difficult for a single lower-rank linear function to approximate functions defined by higher-rank matrices across multiple such regions.
For the practically relevant ReLU activation function (σ(t)=max(0,t)), the separation holds under similar conditions as the linear case, plus an additional constraint that the number of active neurons (kw) is less than the input dimension (d). The proof is more involved as ReLU networks are more expressive than linear ones. It uses a technique based on analyzing the approximation error of a linear function by a non-linear function constrained to a lower-dimensional subspace (Lemma A.11). By constructing experts whose contributions sum to linear functions that are "incoherent" across different expert combinations (Lemma A.14), the paper shows that a single ReLU network constrained to a fixed low-dimensional subspace (representing a low-granularity MoE region) cannot approximate the function computed by a higher-granularity MoE across multiple regions. The proof uses concepts from functional analysis and probabilistic constructions, involving "tensorized" arguments over hypergraphs of expert combinations.
The theoretical findings are supported by simple experimental results (Figures 1 and 2). By training student MoE models to learn a teacher MoE model, the experiments show that for the student model to achieve low test error, its granularity must match or exceed that of the teacher model, even when the student model has significantly more total parameters. This demonstrates that granularity is a crucial factor for learnability and expressivity in practice, not just in theory.
Practical Implications and Limitations:
The research suggests that increasing granularity is a promising direction for designing more expressive and potentially higher-performing MoE models without necessarily increasing computational costs (measured by active parameters). However, the paper acknowledges that higher granularity can increase system-level challenges, such as inter-GPU communication overhead and routing latency. This highlights the need for future work in developing hardware and routing algorithms that can efficiently support highly granular MoEs. The current theoretical results are specific to certain routing schemes, activation functions, and parameter regimes, suggesting avenues for future theoretical work to generalize these findings. The interplay between expressivity and optimization remains an open question, as the paper primarily focuses on the former, although experiments suggest that this expressivity is learnable by standard algorithms like SGD.
In summary, this paper provides strong theoretical evidence that granularity is a powerful knob for increasing the expressivity of MoE models, offering a fundamental understanding of why fine-grained experts are beneficial and pointing towards the importance of developing infrastructure that supports such architectures.