SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training
Abstract: Structured pruning and knowledge distillation (KD) are typical techniques for compressing LLMs, but it remains unclear how they should be applied at pretraining scale, especially to recent mixture-of-experts (MoE) models. In this work, we systematically study MoE compression in large-scale pretraining, focusing on three key questions: whether pruning provides a better initialization than training from scratch, how expert compression choices affect the final model after continued training, and which training strategy is most effective. We have the following findings: First, across depth, width, and expert compression, pruning a pretrained MoE consistently outperforms training the target architecture from scratch under the same training budget. Second, different one-shot expert compression methods converge to similar final performance after large-scale continual pretraining. Motivated by this, we introduce a simple partial-preservation expert merging strategy that improves downstream performance across most benchmarks. Third, combining KD with the language modeling loss outperforms KD alone, particularly on knowledge-intensive tasks. We further propose multi-token prediction (MTP) distillation, which yields consistent gains. Finally, given the same training tokens, progressive pruning schedules outperform one-shot compression, suggesting that gradual architecture transitions lead to better optimization trajectories. Putting it all together, we compress Qwen3-Next-80A3B to a 23A2B model that retains competitive performance. These results offer practical guidance for efficient MoE compression at scale.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Brief Overview
This paper is about making very large AI LLMs smaller, faster, and cheaper without losing too much skill. The authors focus on a special kind of model called a Mixture‑of‑Experts (MoE), which uses many “experts” inside to handle different kinds of text. They test different ways to shrink these models and teach the smaller model to behave like the big one. Their recipe, which they call SlimQwen, keeps strong performance while cutting size by about 4x.
Key Objectives and Questions
The paper explores three simple questions:
- Is it better to start from a big, trained model and carefully trim it down, or to build a small model and train it from scratch?
- When you shrink the “experts” inside an MoE model, does it matter how you pick which experts to remove or merge?
- What is the best way to train the smaller model after shrinking it so it regains as much skill as possible?
They also ask a bonus question: Is it better to shrink the model all at once or step by step?
How They Did It (Methods and Analogies)
To make this clear, think of the big model as a giant school:
- The school building has many floors (layers/depth).
- Each hallway has a certain width (width/hidden size) that controls how many students can pass at once.
- The school employs many specialist teachers (experts in the MoE). A router decides which specialists to send a student to for each question.
The authors use three main tricks to compress the model:
- Pruning (cutting parts)
- Depth pruning: Remove the last chunk of floors from the building (drop some layers).
- Width pruning: Narrow the hallways (make each layer’s hidden size smaller).
- Expert pruning/merging: Reduce the number of specialist teachers. You can either:
- Prune: fire some experts.
- Merge: combine less‑critical experts into stronger ones so their knowledge isn’t lost.
How they choose what to cut: They measure “importance” using a small sample of data (a calibration set). For example, they check: - How often an expert is used (frequency). - How strongly the router prefers an expert (soft logits). - How much an expert’s output matters (activation strength, like REAP).
A simple new idea: partial‑preservation expert merging - Keep about half of the target experts completely intact. - For the rest, merge less‑important experts into the kept ones based on similarity. - This balances keeping expert specialties with not wasting space.
- Knowledge Distillation (KD) and Language Modeling Loss (LM)
- Distillation is like having the big model (teacher) guide the small model (student), showing not just the right answer but also how likely other answers are.
- They also keep the classic “predict the next word” training (LM loss). Mixing KD with LM works better than KD alone, especially for knowledge‑heavy tasks.
Multi‑Token Prediction (MTP) distillation - Normally, models learn to guess the next single word. MTP teaches them to guess several future words at once. - This makes the student learn deeper patterns and also helps speed up generation methods that draft multiple tokens at a time (speculative decoding), because more drafted words get accepted.
- Progressive pruning (step‑by‑step shrinking)
- Instead of shrinking the school in one day, they do it in stages (like removing some floors first, then narrowing hallways later).
- They test three schedules: depth‑first, width‑first, and joint (a bit of both each step).
- After each step, they keep training with distillation and LM loss so the smaller school adapts.
Main Findings and Why They Matter
Here are the most important results in simple terms:
- Starting from a pruned big model beats training a small model from scratch
- If you first train a big model and then carefully cut it down, the smaller model learns faster and ends up more capable than a same‑size model trained from zero on the same budget.
- Different “one‑shot” expert compression methods end up similarly good after long training
- Whether you prune by frequency or merge by similarity, after enough continued training, performance differences are small.
- The new partial‑preservation expert merging helps
- Keeping half of the target experts intact and merging the rest into them gives consistent improvements across many tests.
- Mixing KD with normal LM loss is better than KD alone
- Especially for knowledge‑heavy benchmarks, combining both helps the student model recover more of the teacher’s know‑how.
- Multi‑Token Prediction (MTP) distillation gives steady gains
- It improves overall learning and makes multi‑word drafting more efficient because more drafted tokens are accepted during speculative decoding.
- Progressive pruning beats one‑shot pruning
- Shrinking the model step by step leads to better final results than doing it all at once, given the same total training tokens.
Real‑world highlight: They compress a big MoE model (Qwen3‑Next‑80A3B) down to a much smaller one (23A2B), roughly 4x smaller, while staying competitive on many benchmarks (like MMLU, reasoning, math, coding, and Chinese tests).
Implications and Impact
- Faster, cheaper AI: These methods help companies and researchers run strong LLMs with much lower cost and energy use.
- Practical playbook for MoE compression: The paper gives clear, tested guidance—prune first, prefer progressive schedules, keep partial experts, and train with KD + LM + MTP.
- Better tools for future models: Multi‑token distillation and step‑by‑step shrinking can be reused to compress new models as they grow larger.
- Broader access: Making powerful models smaller and more efficient can help more people and organizations use advanced AI for education, coding help, research, and beyond.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a focused list of what remains missing, uncertain, or unexplored in the paper, framed to guide actionable future research.
- Generality across architectures and scales
- Validate whether findings transfer beyond the specific teacher (Qwen3-Next-80A3B) and student family to other MoE variants (e.g., different router types,
top-1/top-2/top-kgating, no shared experts, Switch/GLU variants) and to dense-only teachers/students. - Test at different model sizes (both smaller and larger) and expert configurations (
N_e,k, number of shared expertsN_s) to establish scaling laws for compression efficacy. - Examine whether the advantages of pruning-based initialization over training-from-scratch persist with much longer training (e.g., >1T tokens) or under low-resource budgets.
- Validate whether findings transfer beyond the specific teacher (Qwen3-Next-80A3B) and student family to other MoE variants (e.g., different router types,
- Data, token budget, and curriculum
- Quantify sensitivity of results to the pretraining mixture composition (domains, languages) and potential data contamination; provide or evaluate on contamination-controlled splits.
- Explore token-budget schedules (stage lengths, more than two stages, nonuniform token allocation) and data curricula aligned to compression stages; identify optimal schedules for different target sparsities.
- Depth and width pruning design
- Move beyond “drop last 25% layers” to systematic layer selection (saliency/Fisher/Hessian/gradient-based) and assess whether noncontiguous pruning yields better continual-pretraining recovery.
- Replace uniform width reduction with layer-wise/nonuniform width pruning; benchmark alternative width-importance metrics (weight magnitude, Fisher diagonals, curvature-aware) against the current activation-based method.
- Analyze sensitivity to calibration-set size and composition for width-importance estimation (current 1,024 samples); establish minimal calibration data needed without harming recovery.
- Expert compression specifics
- Optimize the partial-preservation ratio (currently fixed at 50%) and selection of merge bases; study whether the optimal preserve/merge split depends on
N_e,k, task distribution, or sparsity target. - Compare pairwise merging with clustering-based multi-way merges, iterative/greedy merges, and learned merge weights (e.g., via small KD-tuned adapters) rather than cosine similarity + importance scaling.
- Investigate post-merge router adaptation: effects of retraining vs. pruning router weights, load-balancing losses, temperature/entropy regularization, and routing-collapse prevention during recovery.
- Characterize how compression changes routing entropy, token–expert distributions, and expert specialization over training; detect and mitigate homogenization or dead experts after merging/pruning.
- Optimize the partial-preservation ratio (currently fixed at 50%) and selection of merge bases; study whether the optimal preserve/merge split depends on
- Distillation objectives and hyperparameters
- Systematically sweep KD temperatures, label-smoothing, and logit post-processing; compare online vs. offline teacher logits and caching strategies at pretraining scale.
- Benchmark additional distillation signals: intermediate representation matching, attention map distillation, router-logit distillation, and sequence-level KD; quantify their interactions with LM and MTP losses.
- Study teacher–student quality mismatch (weaker teacher, different architectures/tokenizers) and its influence on recovery and final task performance.
- Multi-token prediction (MTP) distillation
- Ablate MTP depth
D, MTP module depth/width, parameter sharing choices, and projection design; report compute overhead vs. quality and speculative-decoding efficiency trade-offs. - Provide end-to-end serving metrics (latency, throughput, cost) under speculative decoding, not just acceptance rates; specify verifier model/settings and analyze robustness across verifiers.
- Evaluate whether MTP KD benefits extend to chain-of-thought reasoning, long-horizon generation, and alignment-tuned models; probe exposure-bias effects introduced by multi-step supervision.
- Ablate MTP depth
- Progressive pruning schedules
- Expand beyond two stages (40B + 360B) to learned or adaptive multi-stage schedules; jointly optimize stage count, token allocation, and dimension-order (depth/width/experts) per stage.
- Determine schedule robustness across target sparsities and architectures; identify when depth-first vs width-first vs joint scheduling is preferable and why.
- Target search and auto-design
- Develop methods to choose target architectures automatically (depth, width,
N_e,k,N_s) under compute/latency constraints; integrate hardware-aware objectives and routing-cost models. - Explore co-design of per-layer width and per-layer expert counts (nonuniform MoE) to maximize retained capability at fixed FLOPs.
- Develop methods to choose target architectures automatically (depth, width,
- Evaluation breadth and robustness
- Extend evaluation to safety, calibration, hallucination, multilingual beyond English/Chinese, long-context reasoning, tool use, retrieval-augmented tasks, and post-SFT/RLHF performance.
- Assess robustness to distribution shift, adversarial prompts, and noise; measure catastrophic forgetting, especially for rare expert capabilities after compression.
- Report statistical significance and variance across multiple runs/seeds, especially for close method comparisons.
- Systems and efficiency reporting
- Provide end-to-end wall-clock speedups, memory usage, and energy/CO2 metrics for training and serving; isolate contributions from depth/width/expert compression and MTP KD.
- Quantify throughput vs. accuracy trade-offs when changing
k(e.g., 10→8) and number of shared experts; include ablations on expert parallelism and routing kernel overheads.
- Comparisons and baselines
- Include head-to-head large-scale continual-pretraining comparisons with recent MoE compression baselines (e.g., M-SMOE, REAP pruning, Condense-to-dense) under matched token/compute budgets.
- Add missing baselines such as random init + LM-only training, or pruning + LM-only vs pruning + KD-only at multiple scales, to separate effects of initialization vs. objective.
- Theory and interpretability
- Provide theoretical or mechanistic explanations for the observed convergence of different one-shot expert compression methods after long continual pretraining.
- Analyze the dynamics of expert specialization re-emergence post-merge/prune; relate to optimization landscape smoothness under progressive schedules.
- Reproducibility and specification gaps
- Clarify ambiguous loss notation and schedules (e.g.,
λin L = (1-λ) L_LM + A L_KD + B((1-λ) L_MTP-LM + A L_MTP-KD)), KD temperature, and exact router/load-balancing settings post-compression. - Release or detail the pretraining mixture, calibration data selection, and all hyperparameters for pruning importance estimation and MTP modules to enable faithful replication.
- Clarify ambiguous loss notation and schedules (e.g.,
Practical Applications
Below are practical, real-world applications derived from SlimQwen’s findings on MoE pruning, expert merging, distillation, MTP KD, and progressive schedules. Each item notes sectors, potential tools/products/workflows, and feasibility assumptions or dependencies.
Immediate Applications
- Cost-efficient deployment of LLMs via structured MoE compression
- Sectors: software, finance, healthcare, education, public sector
- Potential tools/products/workflows:
- “SlimQwen-style” compression pipelines that prune depth/width/experts and apply joint KD+LM+MTP KD to produce 3–4× smaller production models with competitive quality
- Cloud offerings with “compact-tier” models for lower-latency, lower-cost endpoints
- On-prem deployables for regulated environments
- Assumptions/dependencies:
- Access to a strong pretrained MoE teacher and the right to distill (license clarity)
- Sufficient tokens for continual pretraining (paper uses 120B–400B) to reach parity
- Inference stack support for MoE or use of expert merging to reduce MoE complexity
- Faster interactive assistants and code completion using MTP-distilled speculative decoding
- Sectors: software development tools, customer support, search, education
- Potential tools/products/workflows:
- IDE plugins and chat assistants that leverage higher multi-token draft acceptance rates to cut latency and cost
- Integrations with vLLM/TGI/TensorRT-LLM to add MTP heads and optimize speculative decoding
- Assumptions/dependencies:
- Availability of verifier model and framework support for speculative decoding
- Implementation of MTP KD in training and compatible inference changes
- Progressive pruning schedules for more reliable training and model upgrades
- Sectors: AI platform teams, foundation model builders (industry/academia)
- Potential tools/products/workflows:
- Training orchestrators that implement depth-first/width-first/joint two-stage schedules (e.g., 40B + 360B tokens) to improve recovery vs. one-shot pruning
- “Curriculum compression” plugins for PyTorch Lightning/DeepSpeed/Megatron
- Assumptions/dependencies:
- Scheduler integration into training pipelines; availability of continued pretraining compute
- Monitoring to pick schedules per architecture/dataset
- Robust expert merging and maintenance for MoE models
- Sectors: LLM infrastructure teams, model maintenance/ops
- Potential tools/products/workflows:
- “Partial-preservation expert merging” library that keeps half of target experts intact and merges the rest by similarity/importance (router logits, REAP)
- Routine expert surgery to downsize models without extensive architecture search
- Assumptions/dependencies:
- Small calibration set for importance metrics; router/expert feature extraction
- Empirical tuning of the keep/merge ratio for different tasks/languages
- Lower TCO and energy/carbon footprint for model serving
- Sectors: cloud/energy, SaaS platforms, sustainability programs
- Potential tools/products/workflows:
- Compressed model SKUs with published performance-per-watt improvements
- Autoscaling policies that favor compact models for most requests; switch to bigger teachers only when needed
- Assumptions/dependencies:
- Accurate telemetry of energy use across model sizes
- SLAs that tolerate small performance deltas from teacher
- Academic and startup-friendly pretraining recipes
- Sectors: academia, startups, community labs
- Potential tools/products/workflows:
- Public release of pruned checkpoints as strong initializations (“pruned > from-scratch” under same token budget)
- Course/lab materials demonstrating depth/width/expert pruning + KD + LM + MTP KD
- Assumptions/dependencies:
- Open or permitted access to base checkpoints; sufficient tokens for smaller-scale replication
- Domain-specific compact models for privacy-sensitive settings
- Sectors: healthcare (EHR summarization), finance (policy/compliance Q&A), legal (document analysis), contact centers
- Potential tools/products/workflows:
- Distill large teachers into smaller, on-prem models fine-tuned with domain corpora using hybrid KD+LM and MTP KD
- Privacy-preserving inference on secured clusters with reduced memory/compute
- Assumptions/dependencies:
- Domain datasets with appropriate governance; internal evaluation frameworks
- Validation for safety/compliance before production use
- Edge and on-device assistants where MoE support exists
- Sectors: robotics, telecom, consumer devices (where feasible)
- Potential tools/products/workflows:
- Smaller MoE or merged-expert variants for latency-sensitive tasks (voice, text, planning) on gateways or powerful edge devices
- Offline features (summarization, translation) on enterprise endpoints
- Assumptions/dependencies:
- Inference frameworks with MoE or dense-converted variants on edge hardware
- May require further quantization and expert merging to meet memory/latency budgets
Long-Term Applications
- Auto-compression and autotuning services for MoE
- Sectors: AI platforms, MLOps, cloud providers
- Potential tools/products/workflows:
- Automated systems that search pruning ratios, expert keep/merge splits, and progressive schedules for a given task/SLA
- “Compression-as-a-Service” for customer checkpoints with guardrails and evaluation harnesses
- Assumptions/dependencies:
- More generalizable heuristics across architectures; standardized metric suites
- Hardware–software co-design for compressed MoE
- Sectors: semiconductor, systems software, hyperscalers
- Potential tools/products/workflows:
- Kernels and accelerators optimized for merged/pruned MoE patterns and speculative decoding with MTP
- Memory-layout and routing primitives that exploit partial-preservation structure
- Assumptions/dependencies:
- Vendor support in libraries (cuDNN, oneDNN) and compiler stacks; sufficient adoption to justify investment
- Compression-aware pretraining and continual learning
- Sectors: foundation model labs, enterprise AI
- Potential tools/products/workflows:
- Train-while-compress pipelines that progressively reduce capacity during pretraining to save compute/energy
- Continual learning recipes that interleave data updates with compression to maintain quality
- Assumptions/dependencies:
- New training curricula and validation to avoid catastrophic forgetting
- Dynamic or modular expert loading for personalization
- Sectors: consumer apps, enterprise productivity, robotics
- Potential tools/products/workflows:
- User- or task-specific experts that can be swapped/merged on demand, keeping a compact core model
- Marketplace of expert modules with standardized merging interfaces
- Assumptions/dependencies:
- Robust routing and safety controls; compatibility layers across versions/vendors
- Federated and privacy-preserving distillation
- Sectors: healthcare, finance, government
- Potential tools/products/workflows:
- KD under differential privacy or secure aggregation to produce compact models from siloed data
- On-site MTP KD that accelerates inference without centralizing sensitive data
- Assumptions/dependencies:
- Strong privacy guarantees; reliable local compute; governance for teacher access
- Cross-modal and multilingual compression
- Sectors: multimodal assistants, globalized products, media
- Potential tools/products/workflows:
- Extending partial-preservation merging and progressive schedules to vision/audio-text MoEs and low-resource languages
- Task- and language-aware expert selection/merging policies
- Assumptions/dependencies:
- Evidence of transferability beyond text-only MoE; larger evaluation suites
- Safety, robustness, and compliance frameworks for compressed models
- Sectors: policy, regulated industries
- Potential tools/products/workflows:
- Auditing pipelines that measure how pruning/merging/distillation affect bias, hallucination, and calibration
- Model cards that report compression methods, token budgets, and performance–energy trade-offs
- Assumptions/dependencies:
- Consensus metrics/regulations; access to teacher/student logs for auditing
- Advanced speculative decoding research leveraging MTP
- Sectors: inference platforms, interactive applications
- Potential tools/products/workflows:
- New drafting/verifier algorithms tuned for higher-depth MTP predictions to further boost acceptance rates
- Adaptive k strategies conditioned on content/domain
- Assumptions/dependencies:
- Theoretical and empirical work on stability; framework support for flexible drafting policies
- Edge robotics and real-time systems with compressed language reasoning
- Sectors: robotics, IoT, autonomous systems
- Potential tools/products/workflows:
- Compact reasoning modules for planning and dialogue onboard robots or vehicles
- Hybrid setups where small on-device models handle most queries; escalate to cloud only for complex cases
- Assumptions/dependencies:
- Deterministic latency, real-time safety validation; further compression and specialized kernels for embedded hardware
Notes on feasibility across applications:
- Performance parity with large teachers depends on significant continued pretraining (120B–400B tokens in the paper); smaller budgets may yield smaller gains.
- Legal and ethical use requires attention to teacher licenses, data rights, and domain-specific regulations (especially in healthcare/finance).
- MoE inference support is improving but uneven across edge stacks; heavy expert merging or densification may be needed for mobile-class deployments.
- The “keep half” partial-preservation rule is a pragmatic default; practitioners should tune the preservation fraction for task/language mix.
Glossary
- Acceptance rate: The proportion of drafted tokens accepted by the verifier during speculative decoding; higher rates indicate more efficient multi-token generation. Example: "improving the acceptance rate in multi-token speculative decoding."
- Activation statistics: Aggregate measures (e.g., mean absolute activations) used to estimate the importance of units or dimensions for pruning decisions. Example: "We estimate the importance of each hidden dimension using activation statistics computed on a sampled calibration dataset"
- Calibration dataset: A small sample of data used to compute pruning-related statistics (e.g., importance scores) prior to compression. Example: "a sampled calibration dataset from our training dataset."
- Cosine schedule: A learning-rate schedule that decays following a cosine curve, often used to stabilize training. Example: "decaying to 3e-5 via a cosine schedule"
- Continual pretraining: Further large-scale pretraining of a model after modifications (e.g., pruning or merging) to recover or improve performance. Example: "their efficacy following large-scale continual pretraining remains unexplored."
- Depth pruning: Structured pruning that removes entire transformer layers to reduce model depth and compute. Example: "In our experiments, we prune the last 25\% layers."
- Expert compression: Reducing the number or size of experts in an MoE model through pruning or merging while preserving performance. Example: "Regarding expert compression, we compare various compression strategies, including pruning and merging."
- Expert importance: A quantitative score estimating an expert’s contribution (e.g., via frequency, logits, or REAP) to guide pruning or merging decisions. Example: "The initial step involves quantifying expert importance with various criteria."
- Expert merging: Combining parameters of multiple experts into fewer experts to compress MoE modules while retaining knowledge. Example: "For expert merging, we need to identify both the target clusters and the interpolation weights."
- Expert pruning: Removing less important experts to reduce model size and memory footprint. Example: "expert pruning/merging including removing or merging a number of experts in MoE module."
- Gated Attention: An attention mechanism augmented with gating to improve non-linearity and sparsity characteristics. Example: "Gated Attention modules"
- Gated DeltaNet: A gated module variant (DeltaNet) used within the transformer blocks to improve efficiency or representation. Example: "Gated DeltaNet"
- Hybrid-attention MoE-based model: An MoE architecture that mixes different attention types (e.g., full and linear) across layers. Example: "Qwen3-Next is a hybrid-attention MoE-based model"
- Importance metric: A computed score (e.g., from activations) used to rank dimensions or experts for pruning. Example: "compute the importance metric."
- KL-divergence: A measure of divergence between two probability distributions commonly used as a distillation loss. Example: "We minimize the KL-divergence between teacher and student"
- Language modeling (LM) loss: The standard next-token prediction objective used to train LLMs. Example: "standard language modeling (LM) loss"
- Linear decay schedule: A training schedule where a loss weight decreases linearly over time. Example: "regulated by a linear decay schedule"
- Mixture-of-Experts (MoE): An architecture that routes inputs to a subset of specialized expert networks to scale model capacity efficiently. Example: "Mixture-of-Experts (MoE) has become a dominant architecture for scaling LLMs"
- Multi-token prediction (MTP) distillation: A distillation objective supervising multiple future tokens to improve backbone dynamics and speculative decoding efficiency. Example: "We further propose multi-token prediction (MTP) distillation, which yields consistent gains."
- Next-token knowledge distillation (NTP KD): Distillation where the student matches the teacher’s next-token distribution. Example: "hybridizing next-token knowledge distillation (NTP KD) with a standard language modeling (LM) loss"
- One-shot compression: A single-step compression of a model to its target architecture without intermediate stages. Example: "progressive pruning schedules outperform one-shot compression"
- Partial-preservation expert merging strategy: A merging scheme that keeps a subset of top experts intact while merging the rest into them to avoid homogenization. Example: "we introduce a simple partial-preservation expert merging strategy"
- Progressive pruning schedules: Multi-stage compression procedures that gradually reduce model capacity (depth/width/experts) for smoother optimization. Example: "progressive pruning schedules outperform one-shot compression"
- REAP (router-weighted expert output activation): An expert-importance criterion weighting expert outputs by router scores to guide pruning. Example: "router-weighted expert output activation (REAP)"
- RMSNorm: A normalization method that scales activations by their root mean square without centering. Example: "Qwen3-Next uses the RMSNorm normalizing function"
- Router logits: The unnormalized scores produced by the MoE router used to select top-k experts. Example: "router logits "
- Router weights: The parameters of the MoE router that produce gating/logit scores for expert selection. Example: "router weights and output activation among each expert."
- Routed experts: The subset of experts selected by the router for a given token. Example: "The router produces top- gating scores over the routed experts:"
- Shared experts: Experts that are always available (not gated by top-k) and receive a separate shared gate. Example: "we apply a separate shared gate for shared experts."
- Speculative decoding: A generation technique that drafts multiple tokens and verifies them to accelerate inference. Example: "for speculative decoding across both pretraining and supervised fine-tuning (SFT)"
- Structured pruning: Removing entire architectural components (layers, heads, experts) to yield real speedups without sparse kernels. Example: "Structured pruning and knowledge distillation (KD) are typical techniques for compressing LLMs"
- SwiGLU: An activation-gated MLP variant combining SiLU and linear paths, often used in transformer FFNs. Example: "Each expert is a SwiGLU MLP:"
- Top-k gating: Selecting the top-k experts per token based on router scores. Example: "The router produces top- gating scores over the routed experts:"
- Width pruning: Reducing hidden dimensions (e.g., in attention, FFN, normalization) to shrink model width. Example: "For width pruning, we reduce the hidden dimension across the entire architecture"
Collections
Sign up for free to add this paper to one or more collections.