Sparse Foundation Models

Updated 8 January 2026

Sparse foundation models are large-scale neural networks that activate only a subset of parameters per inference, enhancing efficiency and scalability.
They employ techniques like Mixture-of-Experts, sparse self-attention, and structured pruning to enable dynamic, token-specific computation.
These approaches lower FLOP and memory usage while maintaining robust performance across disciplines such as vision, language, and time series analysis.

A sparse foundation model is a large-scale neural network in which core operations or parameter updates are constrained—either structurally or dynamically—to involve only a subset of parameters, activations, or representational capacity at any given time, while preserving or enhancing the generality, scalability, and transfer capabilities characteristic of foundation models. In current practice, sparsity is achieved via architectural design (Mixture-of-Experts, sparse self-attention, or task-specific routing), pruning/adaptation at fine-tuning, or decompositional/adapter methods for compression and efficiency. These models, spanning vision, language, multi-modal, time series, and domain-specific tasks, now enable improved scalability, computational efficiency, and adaptability without forfeiting the broad coverage and flexibility of dense foundation models.

1. Architectural Principles and Sparsity Mechanisms

Sparse foundation models instantiate sparsity at either the model-architecture level or the fine-tuning/adaptation stage:

Sparse Mixture-of-Experts (MoE) Transformers: Core feed-forward layers in each transformer block are replaced by a set of $M$ "experts"—parallel FFNs of equal dimension. A lightweight gating network dynamically selects the top- $K$ experts per token ( $K \ll M$ ). The model thus realizes a tokenwise, sparse computation pattern. The output is

$y(z) = \sum_{i=1}^M g_i(z) E_i(z)$

where $g_i(z)$ is the gated (softmaxed, sparsified) affiliation of token $z$ to expert $i$ (Liu et al., 2024, Cai et al., 28 May 2025, Zhou et al., 1 Jan 2026).

Rule-Based Parameter Decomposition (Multi-Modality): In models such as Mixture-of-Transformers (MoT), parameters (e.g., QKV projections, FFNs, LayerNorm) are explicitly partitioned by modality (text, image, speech). Each token activates only the parameters specific to its own modality, effecting sparse parameter usage per sample (Liang et al., 2024).
Sparse Self-Attention and Tokenization: Approaches in hyperspectral (HyperSIGMA), vision (SparseFormer), and NeRF-based 3D models (DistillNeRF) employ learned sparse sampling in self-attention layers, or enforce sparse token representations by dynamically focusing on local, salient regions/patches rather than quadratic all-to-all attention (Wang et al., 2024, Gao et al., 2023, Wang et al., 2024).
Row/Structured Sparse Fine-Tuning: Fine-tuning methods (SPruFT, SQFT) exploit precomputed importance metrics to select structured sparse update subspaces, e.g. pruning entire rows/neuron slices per layer, or enforcing a global sparsity mask on adapters (Li et al., 17 Feb 2025, Muñoz et al., 2024).
Sparse Updates and Multi-Task Merging: Model Breadcrumbs constructs sparse, masked task-delta updates between pre-trained and fine-tuned models, allowing low-overhead sparse merging for multi-task adaptation (Davari et al., 2023).
Sparse plus Low-Rank Decomposition: HASSLE-free decomposes dense foundation model weights per layer into a sum of a hardware-friendly sparse matrix (e.g. 2:4 pattern) and a low-rank component, optimizing layerwise local loss to achieve high compression with tolerable accuracy loss (Makni et al., 2 Feb 2025).

These patterns are adapted for diverse foundations, including transformers (language, vision, multi-modal), implicit fields (medical CT), and lifted 3D scenes.

2. Theoretical Formulations and Training Algorithms

Mathematically, sparsity in foundation models is enforced through explicit top- $K$ selection over expert or parameter dimensions (hard-gating), percentile threshold masks over weight diffs (as in Breadcrumbs), or structured sparsity patterns (row, block, N:M):

MoE Layer: For each token embedding $z \in \mathbb{R}^D$ , gating yields $g(z) = \text{Softmax}(\text{TopK}(W_g z + b_g))$ . Combining only $K$ experts ensures parameter and compute sparsity per token (Liu et al., 2024, Cai et al., 28 May 2025, Zhou et al., 1 Jan 2026).
Sparse Adapter Merging: Adapters $L^p=(B\,A)\odot M$ are trained and merged such that the sparsity mask $M$ is preserved before and after quantization. Merging:

$W^p \leftarrow W^p + L^p$

Optionally, quantized as $\widehat W^p_m = \text{clamp}(\text{round}((W^p+L^p)/s) + z, 0, 2^{b_w-1}-1)$ (Muñoz et al., 2024).

Model Breadcrumbs: Sparse delta mask per layer $M_t^{\beta,\gamma}[i] = 1$ if $w_{(\beta)} \leq |\Delta\theta_t[i]| \leq w_{(\gamma)}$ , applied to each task’s diff from pre-trained weights. The aggregate merged model:

$\theta^* = \theta + \alpha\sum_{t=1}^T M_t^{\beta,\gamma}\circ\Delta\theta_t$

(Davari et al., 2023).

Sparse plus Low-Rank Decomposition: Each layer $W$ is decomposed as $W\approx S+L$ , with $S$ in a hardware-N:M sparse set, $L$ of rank $r$ . Alternating minimization over $S$ and $L$ targets $\|X - X(S + L)\|_F^2$ (Makni et al., 2 Feb 2025).

Auxiliary objectives enforce load balancing across experts, ensure sparse utilization does not collapse, and sometimes add regularization on adapter or low-rank magnitude.

3. Empirical Results and Computational Gains

Sparse foundation models consistently demonstrate improved computational efficiency, scalability, and often superior accuracy on representative tasks compared to size-matched dense baselines:

Model/Study	Domain	Main Sparsity Method	FLOP/Memory Savings	Noted Acc. Gain	Reference
Moirai-MoE	Time series	MoE (K=2, M=32), token-wise	28× fewer activated params (vs large)	-8% MAE vs largest dense	(Liu et al., 2024)
SparseFormer	Vision	Sparse tokens + RoI adapt.	81–93% FLOP reduction	–1.2% ImageNet, <1% CLIP loss	(Gao et al., 2023)
HiDream-I1	Gen. vision	MoE (dynamic sparse DiT)	>33% faster (latency)	+10–20% prompt/human preference	(Cai et al., 28 May 2025)
Breadcrumbs	PEFT/NLP/Vis	Sparse task deltas, 15% density	$>6\times$ storage↓	+5–10% multi-task avg.	(Davari et al., 2023)
Traffic-MoE	Network sec.	MoE, tokenwise K=2 gating	91.62% throughput↑	+12.38% Macro-F1; +47% latency↓	(Zhou et al., 1 Jan 2026)
HyperSIGMA	Hyperspectral	SSA (attention, Np=8)% tokens	80% attention FLOPs↓	+15pp OA; <1.5pp adv. deg.	(Wang et al., 2024)
SQFT/SPruFT	LLMs, NLP	Structured mask adapters or rows	20–30% memory↓	≈LoRA accuracy, <2% drop at 50% spars.	(Muñoz et al., 2024, Li et al., 17 Feb 2025)

In time series (Moirai-MoE), in-distribution normalized MAE dropped by 17% in small models and zero-shot CRPS/MASE reached new state-of-the-art with only K=2 active experts per inference pass (Liu et al., 2024). In multi-modal generative models, MoT decouples parameters by modality, cutting cumulative training FLOPs by 45–70% while matching or exceeding dense performance (Liang et al., 2024). For hyperspectral scene interpretation, HyperSIGMA achieves >80% reduction in deep attention FLOPs and new SOTA on 16 benchmarks (Wang et al., 2024).

4. Handling Data Heterogeneity, Non-Stationarity, and Adaptation

Sparse foundation models are explicitly constructed to adapt to diverse statistical regimes and tasks:

Token/patch-level specialization ensures that local statistics—including non-stationarity and heterogeneous sub-patterns—trigger adaptive routing (e.g., Moirai-MoE dynamically clusters tokens exhibiting similar temporal signatures, regardless of frequency band) (Liu et al., 2024). Cluster-based expert gating grounds specialization in pretrained pattern manifolds.
In multi-task and PEFT regimes, row-based or mask-based sparseness enables per-task (or multi-task merged) updates, preserving a core foundation while supporting efficient community-contributed incremental adaptation (Davari et al., 2023, Li et al., 17 Feb 2025).
Cross-modal and domain-specific sparse backbones (HyperSIGMA, MoT) construct parallel or partitioned parameter spaces for distinct modalities, reducing interference and enabling efficient, robust transfer (Liang et al., 2024, Wang et al., 2024).
In mobile AR, sparsity is both at the data (sensed) and model (foundation inference, zero-shot geometry completion) levels, with foundation models compensating for information gaps induced by energy-efficient sensor throttling (Zhao et al., 4 Nov 2025).

5. Hardware Utilization, Model Compression, and Practical Efficiency

Sparse foundation models unlock substantial real-world efficiency gains:

Hardware Patterned Sparsity: Adopting structured sparsity patterns (e.g., 2:4 N:M) ensures compatibility with GPU and accelerator primitives. For instance, HASSLE-free's 2:4+64LR decomposition achieves up to $2\times$ inference speedup without significant perplexity degradation (Makni et al., 2 Feb 2025).
Scalable Pretraining and Efficient Fine-tuning: Techniques including bootstrapped sparse token attention (SparseFormer), parameter decoupling (MoT), and hybrid un/freezing (SparseFormer, HiDream-I1) allow large models to be adapted or fine-tuned with limited hardware and energy expenditure (Gao et al., 2023, Cai et al., 28 May 2025).
Sparse Adapter Merging/Quantization: Methods such as SQFT ensure merged sparse adapters preserve cost and accuracy after quantization, producing deployable low-precision models (e.g., INT4) with <2% accuracy cost at 50% sparsity (Muñoz et al., 2024).
Dynamic Routing and Implicit Regularization: Sparse activation (MoE/k-active or token-patch selection) serves as implicit regularization, reducing overfitting and improving few-shot robustness, as demonstrated in time series, NLP, and traffic analysis scenarios (Liu et al., 2024, Zhou et al., 1 Jan 2026).

6. Limitations, Open Challenges, and Future Directions

Key unresolved issues and open areas for sparse foundation models include:

Expert/pruning underutilization: In dynamic MoE architectures, some experts remain idle; further model compression via pruning, improved load-loss, or hierarchical/online gating could increase efficiency (Liu et al., 2024).
Generalization under distribution shift: While token-level or task-level sparsity can improve robustness, static selection (e.g., row mask in SPruFT) may fail if task distribution evolves; adaptive or learned mask updating is an open direction (Li et al., 17 Feb 2025).
Sparsity and quantization interaction: At very high sparsity (>50–60%), accuracy drop becomes unacceptable in many domains. Techniques blending sparsity with low-rank, quantization, and efficient adapters require further theoretical and empirical investigation (Makni et al., 2 Feb 2025, Muñoz et al., 2024).
Broader modality and multi-modal coverage: Sparse attention (SSA, patch/ROI selection) shows promise in vision, but dynamic adaptation of sparse budgets or budgets across modalities remains an active area (Wang et al., 2024, Liang et al., 2024).
Deployment in edge and real-time environments: Sparse foundation models such as those evaluated in on-device AR and network security offer compelling throughput/latency gains, but adaptation for highly resource-constrained chips, and robust handling of sensor drift, remain challenging (Zhao et al., 4 Nov 2025, Zhou et al., 1 Jan 2026).

Future research is expected to focus on adaptive and learned sparse routing, hierarchical MoE, fully modular sparse foundation model composition, and the integration of structured, data-driven sparsity across all layers and adaptation mechanisms.

Key References

"Moirai-MoE: Empowering Time Series Foundation Models with Sparse Mixture of Experts" (Liu et al., 2024)
"Bootstrapping SparseFormers from Vision Foundation Models" (Gao et al., 2023)
"Model Breadcrumbs: Scaling Multi-Task Model Merging with Sparse Masks" (Davari et al., 2023)
"Propagating Sparse Depth via Depth Foundation Model for Out-of-Distribution Depth Completion" (Chen et al., 7 Aug 2025)
"DeepSparse: A Foundation Model for Sparse-View CBCT Reconstruction" (Lin et al., 5 May 2025)
"Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models" (Liang et al., 2024)
"Traffic-MoE: A Sparse Foundation Model for Network Traffic Analysis" (Zhou et al., 1 Jan 2026)
"DistillNeRF: Perceiving 3D Scenes from Single-Glance Images by Distilling Neural Fields and Foundation Model Features" (Wang et al., 2024)
"An Efficient Row-Based Sparse Fine-Tuning" (Li et al., 17 Feb 2025)
"HiDream-I1: A High-Efficient Image Generative Foundation Model with Sparse Diffusion Transformer" (Cai et al., 28 May 2025)
"SQFT: Low-cost Model Adaptation in Low-precision Sparse Foundation Models" (Muñoz et al., 2024)
"HyperSIGMA: Hyperspectral Intelligence Comprehension Foundation Model" (Wang et al., 2024)
"Can Foundation Models Revolutionize Mobile AR Sparse Sensing?" (Zhao et al., 4 Nov 2025)
"HASSLE-free: A unified Framework for Sparse plus Low-Rank Matrix Decomposition for LLMs" (Makni et al., 2 Feb 2025)