Papers
Topics
Authors
Recent
Search
2000 character limit reached

SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training

Published 9 May 2026 in cs.LG, cs.AI, and cs.CL | (2605.08738v1)

Abstract: Structured pruning and knowledge distillation (KD) are typical techniques for compressing LLMs, but it remains unclear how they should be applied at pretraining scale, especially to recent mixture-of-experts (MoE) models. In this work, we systematically study MoE compression in large-scale pretraining, focusing on three key questions: whether pruning provides a better initialization than training from scratch, how expert compression choices affect the final model after continued training, and which training strategy is most effective. We have the following findings: First, across depth, width, and expert compression, pruning a pretrained MoE consistently outperforms training the target architecture from scratch under the same training budget. Second, different one-shot expert compression methods converge to similar final performance after large-scale continual pretraining. Motivated by this, we introduce a simple partial-preservation expert merging strategy that improves downstream performance across most benchmarks. Third, combining KD with the language modeling loss outperforms KD alone, particularly on knowledge-intensive tasks. We further propose multi-token prediction (MTP) distillation, which yields consistent gains. Finally, given the same training tokens, progressive pruning schedules outperform one-shot compression, suggesting that gradual architecture transitions lead to better optimization trajectories. Putting it all together, we compress Qwen3-Next-80A3B to a 23A2B model that retains competitive performance. These results offer practical guidance for efficient MoE compression at scale.

Summary

  • The paper demonstrates that structured pruning of MoE pretraining yields faster training and up to an 11.79-point improvement over random initialization.
  • The methodology combines expert merging with hybrid knowledge distillation and multi-token prediction to enhance generalization and decoding efficiency.
  • The study finds that progressive, multi-stage pruning outperforms one-shot compression, enabling robust model recovery and efficient deployment of large-scale MoE architectures.

SlimQwen: Compression Strategies for Mixture-of-Experts Pretraining

Introduction

"SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training" (2605.08738) provides a thorough empirical and methodological analysis of large-scale Mixture-of-Experts (MoE) model compression. The investigation focuses on structured pruning and knowledge distillation (KD) during pretraining, with special attention to practical recipes for expert pruning/merging, model width/depth reduction, and progressive pruning schedules. The central objective is to enable efficient deployment of post-pretraining MoE LLMs without significant loss in modeling capability.

Structured Pruning and Initialization

A primary finding is that structured pruning—removing components at the architectural level (layers, hidden units, or experts)—from a pretrained MoE model produces students that train faster and reach significantly higher downstream performance compared to training target architectures from scratch under identical training budgets. On diverse benchmarks (MMLU, BBH, GSM-8K, coding, Chinese), pruned initializations achieve up to 11.79-point absolute improvement over random initializations, recovering 86.5%86.5\% of teacher performance despite 4×4\times compression. These results demonstrate that model pruning preserves crucial weight initialization and inductive biases, improving optimization and sample efficiency.

Expert Compression: Pruning and Partial-Preservation Merging

The work evaluates multiple expert compression paradigms: pruning via several importance metrics (frequency, router logits, activations), merging of similar experts, and hybrid partial-preservation strategies. An important empirical result is that after extensive continual pretraining, one-shot expert compression strategies yield marginal differences in final performance across major benchmarks. However, the proposed partial-preservation merging strategy—preserving half of target experts and merging the remainder—consistently improves generalization. This strategy prevents over-homogenization of expertise within the remaining experts, maintaining the knowledge diversity necessary for robust downstream performance.

Training Objectives: Hybrid Knowledge Distillation and Multi-Token Prediction

The analysis confronts conventional KD-only recovery recipes by showing that hybridizing KD with standard language modeling (LM) loss yields superior results, especially on knowledge-intensive tasks. Moreover, the introduction of multi-token prediction (MTP) distillation, where multiple future tokens are predicted and matched to teacher soft targets, enhances backbone representations and speculative decoding efficiency, as measured by multi-token acceptance rates. Integrating MTP KD consistently outperforms next-token KD, offering practical throughput advantages for tasks requiring efficient multi-token generation.

Progressive Pruning and Distillation Schedules

A compelling contribution is the demonstration that progressive, multi-stage pruning and distillation schedules consistently surpass one-shot compression, under fixed token budgets and targeting identical sparsity levels. Depth-first, width-first, and joint pruning schedules were examined. Progressive depth-first strategies were found particularly effective, improving MMLU from 75.86 (one-shot) to 77.39, and similar gains for MMLU-Redux and other multi-domain evaluations. This suggests that gradual capacity reduction with interleaved training allows for smoother optimization trajectories and mitigates catastrophic forgetting, facilitating more effective knowledge transfer.

Compression Efficacy and Efficiency Analysis

By applying the described methodology, Qwen3-Next-80A3B (\sim80B parameters) can be effectively compressed to a 23A2B architecture (\sim23B parameters), retaining competitive general and specialized task performance. The SlimQwen models demonstrate notable speedups in both training and inference, reduced memory footprint, and enhanced suitability for single-device deployment. The work provides a robust baseline for compute-constrained continual pretraining and deployment of large-scale MoE architectures.

Implications and Future Directions

The research clarifies critical aspects of MoE model compression at scale, including:

  • The empirical superiority of pruning-based initialization versus training compact architectures from scratch.
  • The necessity of combining LM loss and KD objectives, and the special advantages of multi-token distillation for both accuracy and generative efficiency.
  • The marginal difference among one-shot expert compression methods post-pretraining, but the value of hybrid partial-preservation strategies for further optimization.
  • The systematic advantage of progressive over one-shot pruning schedules for practical model recovery.

The results advocate for further research in adaptive, dynamic pruning schedules, more granular expert merging policies, and autonomous selection of hybrid training objectives. Integration with orthogonal efficiency advances, such as quantization and sparsity, can further amplify practical deployment benefits.

Conclusion

The SlimQwen study delivers a comprehensive and systematic evaluation of structured pruning, expert merging, and hybrid distillation for large MoE LLMs. The evidence establishes best practices for compressing and recovering pretrained MoE models, with direct implications for efficient training, inference, and deployment in resource-constrained environments. Future extensions should target broader architectures and modalities, adaptive compression trajectories, and further synthesis of distillation and pruning with architectural innovations.

Whiteboard

There was an error generating the whiteboard.

Explain it Like I'm 14

Brief Overview

This paper is about making very large AI LLMs smaller, faster, and cheaper without losing too much skill. The authors focus on a special kind of model called a Mixture‑of‑Experts (MoE), which uses many “experts” inside to handle different kinds of text. They test different ways to shrink these models and teach the smaller model to behave like the big one. Their recipe, which they call SlimQwen, keeps strong performance while cutting size by about 4x.

Key Objectives and Questions

The paper explores three simple questions:

  1. Is it better to start from a big, trained model and carefully trim it down, or to build a small model and train it from scratch?
  2. When you shrink the “experts” inside an MoE model, does it matter how you pick which experts to remove or merge?
  3. What is the best way to train the smaller model after shrinking it so it regains as much skill as possible?

They also ask a bonus question: Is it better to shrink the model all at once or step by step?

How They Did It (Methods and Analogies)

To make this clear, think of the big model as a giant school:

  • The school building has many floors (layers/depth).
  • Each hallway has a certain width (width/hidden size) that controls how many students can pass at once.
  • The school employs many specialist teachers (experts in the MoE). A router decides which specialists to send a student to for each question.

The authors use three main tricks to compress the model:

  1. Pruning (cutting parts)
    • Depth pruning: Remove the last chunk of floors from the building (drop some layers).
    • Width pruning: Narrow the hallways (make each layer’s hidden size smaller).
    • Expert pruning/merging: Reduce the number of specialist teachers. You can either:
      • Prune: fire some experts.
      • Merge: combine less‑critical experts into stronger ones so their knowledge isn’t lost.

How they choose what to cut: They measure “importance” using a small sample of data (a calibration set). For example, they check: - How often an expert is used (frequency). - How strongly the router prefers an expert (soft logits). - How much an expert’s output matters (activation strength, like REAP).

A simple new idea: partial‑preservation expert merging - Keep about half of the target experts completely intact. - For the rest, merge less‑important experts into the kept ones based on similarity. - This balances keeping expert specialties with not wasting space.

  1. Knowledge Distillation (KD) and Language Modeling Loss (LM)
    • Distillation is like having the big model (teacher) guide the small model (student), showing not just the right answer but also how likely other answers are.
    • They also keep the classic “predict the next word” training (LM loss). Mixing KD with LM works better than KD alone, especially for knowledge‑heavy tasks.

Multi‑Token Prediction (MTP) distillation - Normally, models learn to guess the next single word. MTP teaches them to guess several future words at once. - This makes the student learn deeper patterns and also helps speed up generation methods that draft multiple tokens at a time (speculative decoding), because more drafted words get accepted.

  1. Progressive pruning (step‑by‑step shrinking)
    • Instead of shrinking the school in one day, they do it in stages (like removing some floors first, then narrowing hallways later).
    • They test three schedules: depth‑first, width‑first, and joint (a bit of both each step).
    • After each step, they keep training with distillation and LM loss so the smaller school adapts.

Main Findings and Why They Matter

Here are the most important results in simple terms:

  • Starting from a pruned big model beats training a small model from scratch
    • If you first train a big model and then carefully cut it down, the smaller model learns faster and ends up more capable than a same‑size model trained from zero on the same budget.
  • Different “one‑shot” expert compression methods end up similarly good after long training
    • Whether you prune by frequency or merge by similarity, after enough continued training, performance differences are small.
  • The new partial‑preservation expert merging helps
    • Keeping half of the target experts intact and merging the rest into them gives consistent improvements across many tests.
  • Mixing KD with normal LM loss is better than KD alone
    • Especially for knowledge‑heavy benchmarks, combining both helps the student model recover more of the teacher’s know‑how.
  • Multi‑Token Prediction (MTP) distillation gives steady gains
    • It improves overall learning and makes multi‑word drafting more efficient because more drafted tokens are accepted during speculative decoding.
  • Progressive pruning beats one‑shot pruning
    • Shrinking the model step by step leads to better final results than doing it all at once, given the same total training tokens.

Real‑world highlight: They compress a big MoE model (Qwen3‑Next‑80A3B) down to a much smaller one (23A2B), roughly 4x smaller, while staying competitive on many benchmarks (like MMLU, reasoning, math, coding, and Chinese tests).

Implications and Impact

  • Faster, cheaper AI: These methods help companies and researchers run strong LLMs with much lower cost and energy use.
  • Practical playbook for MoE compression: The paper gives clear, tested guidance—prune first, prefer progressive schedules, keep partial experts, and train with KD + LM + MTP.
  • Better tools for future models: Multi‑token distillation and step‑by‑step shrinking can be reused to compress new models as they grow larger.
  • Broader access: Making powerful models smaller and more efficient can help more people and organizations use advanced AI for education, coding help, research, and beyond.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a focused list of what remains missing, uncertain, or unexplored in the paper, framed to guide actionable future research.

  • Generality across architectures and scales
    • Validate whether findings transfer beyond the specific teacher (Qwen3-Next-80A3B) and student family to other MoE variants (e.g., different router types, top-1/top-2/top-k gating, no shared experts, Switch/GLU variants) and to dense-only teachers/students.
    • Test at different model sizes (both smaller and larger) and expert configurations (N_e, k, number of shared experts N_s) to establish scaling laws for compression efficacy.
    • Examine whether the advantages of pruning-based initialization over training-from-scratch persist with much longer training (e.g., >1T tokens) or under low-resource budgets.
  • Data, token budget, and curriculum
    • Quantify sensitivity of results to the pretraining mixture composition (domains, languages) and potential data contamination; provide or evaluate on contamination-controlled splits.
    • Explore token-budget schedules (stage lengths, more than two stages, nonuniform token allocation) and data curricula aligned to compression stages; identify optimal schedules for different target sparsities.
  • Depth and width pruning design
    • Move beyond “drop last 25% layers” to systematic layer selection (saliency/Fisher/Hessian/gradient-based) and assess whether noncontiguous pruning yields better continual-pretraining recovery.
    • Replace uniform width reduction with layer-wise/nonuniform width pruning; benchmark alternative width-importance metrics (weight magnitude, Fisher diagonals, curvature-aware) against the current activation-based method.
    • Analyze sensitivity to calibration-set size and composition for width-importance estimation (current 1,024 samples); establish minimal calibration data needed without harming recovery.
  • Expert compression specifics
    • Optimize the partial-preservation ratio (currently fixed at 50%) and selection of merge bases; study whether the optimal preserve/merge split depends on N_e, k, task distribution, or sparsity target.
    • Compare pairwise merging with clustering-based multi-way merges, iterative/greedy merges, and learned merge weights (e.g., via small KD-tuned adapters) rather than cosine similarity + importance scaling.
    • Investigate post-merge router adaptation: effects of retraining vs. pruning router weights, load-balancing losses, temperature/entropy regularization, and routing-collapse prevention during recovery.
    • Characterize how compression changes routing entropy, token–expert distributions, and expert specialization over training; detect and mitigate homogenization or dead experts after merging/pruning.
  • Distillation objectives and hyperparameters
    • Systematically sweep KD temperatures, label-smoothing, and logit post-processing; compare online vs. offline teacher logits and caching strategies at pretraining scale.
    • Benchmark additional distillation signals: intermediate representation matching, attention map distillation, router-logit distillation, and sequence-level KD; quantify their interactions with LM and MTP losses.
    • Study teacher–student quality mismatch (weaker teacher, different architectures/tokenizers) and its influence on recovery and final task performance.
  • Multi-token prediction (MTP) distillation
    • Ablate MTP depth D, MTP module depth/width, parameter sharing choices, and projection design; report compute overhead vs. quality and speculative-decoding efficiency trade-offs.
    • Provide end-to-end serving metrics (latency, throughput, cost) under speculative decoding, not just acceptance rates; specify verifier model/settings and analyze robustness across verifiers.
    • Evaluate whether MTP KD benefits extend to chain-of-thought reasoning, long-horizon generation, and alignment-tuned models; probe exposure-bias effects introduced by multi-step supervision.
  • Progressive pruning schedules
    • Expand beyond two stages (40B + 360B) to learned or adaptive multi-stage schedules; jointly optimize stage count, token allocation, and dimension-order (depth/width/experts) per stage.
    • Determine schedule robustness across target sparsities and architectures; identify when depth-first vs width-first vs joint scheduling is preferable and why.
  • Target search and auto-design
    • Develop methods to choose target architectures automatically (depth, width, N_e, k, N_s) under compute/latency constraints; integrate hardware-aware objectives and routing-cost models.
    • Explore co-design of per-layer width and per-layer expert counts (nonuniform MoE) to maximize retained capability at fixed FLOPs.
  • Evaluation breadth and robustness
    • Extend evaluation to safety, calibration, hallucination, multilingual beyond English/Chinese, long-context reasoning, tool use, retrieval-augmented tasks, and post-SFT/RLHF performance.
    • Assess robustness to distribution shift, adversarial prompts, and noise; measure catastrophic forgetting, especially for rare expert capabilities after compression.
    • Report statistical significance and variance across multiple runs/seeds, especially for close method comparisons.
  • Systems and efficiency reporting
    • Provide end-to-end wall-clock speedups, memory usage, and energy/CO2 metrics for training and serving; isolate contributions from depth/width/expert compression and MTP KD.
    • Quantify throughput vs. accuracy trade-offs when changing k (e.g., 10→8) and number of shared experts; include ablations on expert parallelism and routing kernel overheads.
  • Comparisons and baselines
    • Include head-to-head large-scale continual-pretraining comparisons with recent MoE compression baselines (e.g., M-SMOE, REAP pruning, Condense-to-dense) under matched token/compute budgets.
    • Add missing baselines such as random init + LM-only training, or pruning + LM-only vs pruning + KD-only at multiple scales, to separate effects of initialization vs. objective.
  • Theory and interpretability
    • Provide theoretical or mechanistic explanations for the observed convergence of different one-shot expert compression methods after long continual pretraining.
    • Analyze the dynamics of expert specialization re-emergence post-merge/prune; relate to optimization landscape smoothness under progressive schedules.
  • Reproducibility and specification gaps
    • Clarify ambiguous loss notation and schedules (e.g., λ in L = (1-λ) L_LM + A L_KD + B((1-λ) L_MTP-LM + A L_MTP-KD)), KD temperature, and exact router/load-balancing settings post-compression.
    • Release or detail the pretraining mixture, calibration data selection, and all hyperparameters for pruning importance estimation and MTP modules to enable faithful replication.

Practical Applications

Below are practical, real-world applications derived from SlimQwen’s findings on MoE pruning, expert merging, distillation, MTP KD, and progressive schedules. Each item notes sectors, potential tools/products/workflows, and feasibility assumptions or dependencies.

Immediate Applications

  • Cost-efficient deployment of LLMs via structured MoE compression
    • Sectors: software, finance, healthcare, education, public sector
    • Potential tools/products/workflows:
    • “SlimQwen-style” compression pipelines that prune depth/width/experts and apply joint KD+LM+MTP KD to produce 3–4× smaller production models with competitive quality
    • Cloud offerings with “compact-tier” models for lower-latency, lower-cost endpoints
    • On-prem deployables for regulated environments
    • Assumptions/dependencies:
    • Access to a strong pretrained MoE teacher and the right to distill (license clarity)
    • Sufficient tokens for continual pretraining (paper uses 120B–400B) to reach parity
    • Inference stack support for MoE or use of expert merging to reduce MoE complexity
  • Faster interactive assistants and code completion using MTP-distilled speculative decoding
    • Sectors: software development tools, customer support, search, education
    • Potential tools/products/workflows:
    • IDE plugins and chat assistants that leverage higher multi-token draft acceptance rates to cut latency and cost
    • Integrations with vLLM/TGI/TensorRT-LLM to add MTP heads and optimize speculative decoding
    • Assumptions/dependencies:
    • Availability of verifier model and framework support for speculative decoding
    • Implementation of MTP KD in training and compatible inference changes
  • Progressive pruning schedules for more reliable training and model upgrades
    • Sectors: AI platform teams, foundation model builders (industry/academia)
    • Potential tools/products/workflows:
    • Training orchestrators that implement depth-first/width-first/joint two-stage schedules (e.g., 40B + 360B tokens) to improve recovery vs. one-shot pruning
    • “Curriculum compression” plugins for PyTorch Lightning/DeepSpeed/Megatron
    • Assumptions/dependencies:
    • Scheduler integration into training pipelines; availability of continued pretraining compute
    • Monitoring to pick schedules per architecture/dataset
  • Robust expert merging and maintenance for MoE models
    • Sectors: LLM infrastructure teams, model maintenance/ops
    • Potential tools/products/workflows:
    • “Partial-preservation expert merging” library that keeps half of target experts intact and merges the rest by similarity/importance (router logits, REAP)
    • Routine expert surgery to downsize models without extensive architecture search
    • Assumptions/dependencies:
    • Small calibration set for importance metrics; router/expert feature extraction
    • Empirical tuning of the keep/merge ratio for different tasks/languages
  • Lower TCO and energy/carbon footprint for model serving
    • Sectors: cloud/energy, SaaS platforms, sustainability programs
    • Potential tools/products/workflows:
    • Compressed model SKUs with published performance-per-watt improvements
    • Autoscaling policies that favor compact models for most requests; switch to bigger teachers only when needed
    • Assumptions/dependencies:
    • Accurate telemetry of energy use across model sizes
    • SLAs that tolerate small performance deltas from teacher
  • Academic and startup-friendly pretraining recipes
    • Sectors: academia, startups, community labs
    • Potential tools/products/workflows:
    • Public release of pruned checkpoints as strong initializations (“pruned > from-scratch” under same token budget)
    • Course/lab materials demonstrating depth/width/expert pruning + KD + LM + MTP KD
    • Assumptions/dependencies:
    • Open or permitted access to base checkpoints; sufficient tokens for smaller-scale replication
  • Domain-specific compact models for privacy-sensitive settings
    • Sectors: healthcare (EHR summarization), finance (policy/compliance Q&A), legal (document analysis), contact centers
    • Potential tools/products/workflows:
    • Distill large teachers into smaller, on-prem models fine-tuned with domain corpora using hybrid KD+LM and MTP KD
    • Privacy-preserving inference on secured clusters with reduced memory/compute
    • Assumptions/dependencies:
    • Domain datasets with appropriate governance; internal evaluation frameworks
    • Validation for safety/compliance before production use
  • Edge and on-device assistants where MoE support exists
    • Sectors: robotics, telecom, consumer devices (where feasible)
    • Potential tools/products/workflows:
    • Smaller MoE or merged-expert variants for latency-sensitive tasks (voice, text, planning) on gateways or powerful edge devices
    • Offline features (summarization, translation) on enterprise endpoints
    • Assumptions/dependencies:
    • Inference frameworks with MoE or dense-converted variants on edge hardware
    • May require further quantization and expert merging to meet memory/latency budgets

Long-Term Applications

  • Auto-compression and autotuning services for MoE
    • Sectors: AI platforms, MLOps, cloud providers
    • Potential tools/products/workflows:
    • Automated systems that search pruning ratios, expert keep/merge splits, and progressive schedules for a given task/SLA
    • “Compression-as-a-Service” for customer checkpoints with guardrails and evaluation harnesses
    • Assumptions/dependencies:
    • More generalizable heuristics across architectures; standardized metric suites
  • Hardware–software co-design for compressed MoE
    • Sectors: semiconductor, systems software, hyperscalers
    • Potential tools/products/workflows:
    • Kernels and accelerators optimized for merged/pruned MoE patterns and speculative decoding with MTP
    • Memory-layout and routing primitives that exploit partial-preservation structure
    • Assumptions/dependencies:
    • Vendor support in libraries (cuDNN, oneDNN) and compiler stacks; sufficient adoption to justify investment
  • Compression-aware pretraining and continual learning
    • Sectors: foundation model labs, enterprise AI
    • Potential tools/products/workflows:
    • Train-while-compress pipelines that progressively reduce capacity during pretraining to save compute/energy
    • Continual learning recipes that interleave data updates with compression to maintain quality
    • Assumptions/dependencies:
    • New training curricula and validation to avoid catastrophic forgetting
  • Dynamic or modular expert loading for personalization
    • Sectors: consumer apps, enterprise productivity, robotics
    • Potential tools/products/workflows:
    • User- or task-specific experts that can be swapped/merged on demand, keeping a compact core model
    • Marketplace of expert modules with standardized merging interfaces
    • Assumptions/dependencies:
    • Robust routing and safety controls; compatibility layers across versions/vendors
  • Federated and privacy-preserving distillation
    • Sectors: healthcare, finance, government
    • Potential tools/products/workflows:
    • KD under differential privacy or secure aggregation to produce compact models from siloed data
    • On-site MTP KD that accelerates inference without centralizing sensitive data
    • Assumptions/dependencies:
    • Strong privacy guarantees; reliable local compute; governance for teacher access
  • Cross-modal and multilingual compression
    • Sectors: multimodal assistants, globalized products, media
    • Potential tools/products/workflows:
    • Extending partial-preservation merging and progressive schedules to vision/audio-text MoEs and low-resource languages
    • Task- and language-aware expert selection/merging policies
    • Assumptions/dependencies:
    • Evidence of transferability beyond text-only MoE; larger evaluation suites
  • Safety, robustness, and compliance frameworks for compressed models
    • Sectors: policy, regulated industries
    • Potential tools/products/workflows:
    • Auditing pipelines that measure how pruning/merging/distillation affect bias, hallucination, and calibration
    • Model cards that report compression methods, token budgets, and performance–energy trade-offs
    • Assumptions/dependencies:
    • Consensus metrics/regulations; access to teacher/student logs for auditing
  • Advanced speculative decoding research leveraging MTP
    • Sectors: inference platforms, interactive applications
    • Potential tools/products/workflows:
    • New drafting/verifier algorithms tuned for higher-depth MTP predictions to further boost acceptance rates
    • Adaptive k strategies conditioned on content/domain
    • Assumptions/dependencies:
    • Theoretical and empirical work on stability; framework support for flexible drafting policies
  • Edge robotics and real-time systems with compressed language reasoning
    • Sectors: robotics, IoT, autonomous systems
    • Potential tools/products/workflows:
    • Compact reasoning modules for planning and dialogue onboard robots or vehicles
    • Hybrid setups where small on-device models handle most queries; escalate to cloud only for complex cases
    • Assumptions/dependencies:
    • Deterministic latency, real-time safety validation; further compression and specialized kernels for embedded hardware

Notes on feasibility across applications:

  • Performance parity with large teachers depends on significant continued pretraining (120B–400B tokens in the paper); smaller budgets may yield smaller gains.
  • Legal and ethical use requires attention to teacher licenses, data rights, and domain-specific regulations (especially in healthcare/finance).
  • MoE inference support is improving but uneven across edge stacks; heavy expert merging or densification may be needed for mobile-class deployments.
  • The “keep half” partial-preservation rule is a pragmatic default; practitioners should tune the preservation fraction for task/language mix.

Glossary

  • Acceptance rate: The proportion of drafted tokens accepted by the verifier during speculative decoding; higher rates indicate more efficient multi-token generation. Example: "improving the acceptance rate in multi-token speculative decoding."
  • Activation statistics: Aggregate measures (e.g., mean absolute activations) used to estimate the importance of units or dimensions for pruning decisions. Example: "We estimate the importance of each hidden dimension using activation statistics computed on a sampled calibration dataset"
  • Calibration dataset: A small sample of data used to compute pruning-related statistics (e.g., importance scores) prior to compression. Example: "a sampled calibration dataset D\mathcal{D} from our training dataset."
  • Cosine schedule: A learning-rate schedule that decays following a cosine curve, often used to stabilize training. Example: "decaying to 3e-5 via a cosine schedule"
  • Continual pretraining: Further large-scale pretraining of a model after modifications (e.g., pruning or merging) to recover or improve performance. Example: "their efficacy following large-scale continual pretraining remains unexplored."
  • Depth pruning: Structured pruning that removes entire transformer layers to reduce model depth and compute. Example: "In our experiments, we prune the last 25\% layers."
  • Expert compression: Reducing the number or size of experts in an MoE model through pruning or merging while preserving performance. Example: "Regarding expert compression, we compare various compression strategies, including pruning and merging."
  • Expert importance: A quantitative score estimating an expert’s contribution (e.g., via frequency, logits, or REAP) to guide pruning or merging decisions. Example: "The initial step involves quantifying expert importance with various criteria."
  • Expert merging: Combining parameters of multiple experts into fewer experts to compress MoE modules while retaining knowledge. Example: "For expert merging, we need to identify both the target clusters and the interpolation weights."
  • Expert pruning: Removing less important experts to reduce model size and memory footprint. Example: "expert pruning/merging including removing or merging a number of experts in MoE module."
  • Gated Attention: An attention mechanism augmented with gating to improve non-linearity and sparsity characteristics. Example: "Gated Attention modules"
  • Gated DeltaNet: A gated module variant (DeltaNet) used within the transformer blocks to improve efficiency or representation. Example: "Gated DeltaNet"
  • Hybrid-attention MoE-based model: An MoE architecture that mixes different attention types (e.g., full and linear) across layers. Example: "Qwen3-Next is a hybrid-attention MoE-based model"
  • Importance metric: A computed score (e.g., from activations) used to rank dimensions or experts for pruning. Example: "compute the importance metric."
  • KL-divergence: A measure of divergence between two probability distributions commonly used as a distillation loss. Example: "We minimize the KL-divergence between teacher and student"
  • Language modeling (LM) loss: The standard next-token prediction objective used to train LLMs. Example: "standard language modeling (LM) loss"
  • Linear decay schedule: A training schedule where a loss weight decreases linearly over time. Example: "regulated by a linear decay schedule"
  • Mixture-of-Experts (MoE): An architecture that routes inputs to a subset of specialized expert networks to scale model capacity efficiently. Example: "Mixture-of-Experts (MoE) has become a dominant architecture for scaling LLMs"
  • Multi-token prediction (MTP) distillation: A distillation objective supervising multiple future tokens to improve backbone dynamics and speculative decoding efficiency. Example: "We further propose multi-token prediction (MTP) distillation, which yields consistent gains."
  • Next-token knowledge distillation (NTP KD): Distillation where the student matches the teacher’s next-token distribution. Example: "hybridizing next-token knowledge distillation (NTP KD) with a standard language modeling (LM) loss"
  • One-shot compression: A single-step compression of a model to its target architecture without intermediate stages. Example: "progressive pruning schedules outperform one-shot compression"
  • Partial-preservation expert merging strategy: A merging scheme that keeps a subset of top experts intact while merging the rest into them to avoid homogenization. Example: "we introduce a simple partial-preservation expert merging strategy"
  • Progressive pruning schedules: Multi-stage compression procedures that gradually reduce model capacity (depth/width/experts) for smoother optimization. Example: "progressive pruning schedules outperform one-shot compression"
  • REAP (router-weighted expert output activation): An expert-importance criterion weighting expert outputs by router scores to guide pruning. Example: "router-weighted expert output activation (REAP)"
  • RMSNorm: A normalization method that scales activations by their root mean square without centering. Example: "Qwen3-Next uses the RMSNorm normalizing function"
  • Router logits: The unnormalized scores produced by the MoE router used to select top-k experts. Example: "router logits z(x)=R(x)z(x)=R(x)"
  • Router weights: The parameters of the MoE router that produce gating/logit scores for expert selection. Example: "router weights and output activation Ej(x)E_j(x) among each expert."
  • Routed experts: The subset of experts selected by the router for a given token. Example: "The router produces top-kk gating scores over the routed experts:"
  • Shared experts: Experts that are always available (not gated by top-k) and receive a separate shared gate. Example: "we apply a separate shared gate zs(x)=σ(xwsh)Rnsharedz_{\mathrm{s}}(x)=\sigma(xw_{\mathrm{sh}})\in\mathbb{R}^{n_{\text{shared}}} for shared experts."
  • Speculative decoding: A generation technique that drafts multiple tokens and verifies them to accelerate inference. Example: "for speculative decoding across both pretraining and supervised fine-tuning (SFT)"
  • Structured pruning: Removing entire architectural components (layers, heads, experts) to yield real speedups without sparse kernels. Example: "Structured pruning and knowledge distillation (KD) are typical techniques for compressing LLMs"
  • SwiGLU: An activation-gated MLP variant combining SiLU and linear paths, often used in transformer FFNs. Example: "Each expert is a SwiGLU MLP:"
  • Top-k gating: Selecting the top-k experts per token based on router scores. Example: "The router produces top-kk gating scores over the routed experts:"
  • Width pruning: Reducing hidden dimensions (e.g., in attention, FFN, normalization) to shrink model width. Example: "For width pruning, we reduce the hidden dimension across the entire architecture"

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 4 tweets with 851 likes about this paper.