Stitchable Neural Networks (SN-Net)

Updated 17 April 2026

Stitchable Neural Networks are a paradigm that recombines fragments of pre-trained models using stitching layers to reconcile feature representations.
They enable dynamic trade-offs between accuracy and computational resources through adaptive model fusion and parameter-efficient tuning.
Their design supports flexible deployment across tasks like classification, segmentation, and personalization with minimal retraining.

Stitchable Neural Networks (SN-Net) are a general paradigm for constructing new neural networks by recombining, fusing, or adaptively mixing the layers or fragments of existing pre-trained models, often inserting lightweight trainable adapters, known as stitching layers, to reconcile feature dimension or representation mismatches. This enables efficient generation of a spectrum of models that interpolate between accuracy, computational resource usage, and architectural flexibility, supporting dynamic adaptation to deployment constraints and facilitating knowledge transfer between disparate sources.

1. Fundamental Principles and Definitions

A Stitchable Neural Network (SN-Net, Editor's term) is constructed by partitioning one or more pre-trained “anchors”—networks of the same or different architectures—at user- or algorithm-selected cut-points and joining their fragments via small parametric layers ("stitches" or "stitching layers") that map feature representations between upstream and downstream blocks. The resulting hybrid network executes as a single end-to-end model, with the stitching layers trained (typically) on domain-relevant data to restore compatibility and optimize task-relevant performance (Pan et al., 2023, Wang et al., 8 Jun 2025).

Formally, given two anchors $A_i$ and $A_j$ (e.g., with layerwise functions $f^{(i)}_1,\dots,f^{(i)}_{L_i}$ ), a typical SN-Net $F_{i \rightarrow j, (\ell, m)}$ is defined as:

$F_{i \rightarrow j, (\ell, m)}(x) = T_{\theta^{(j)}, m} \circ S_{i \rightarrow j, (\ell, m)} \circ H_{\theta^{(i)}, \ell}(x)$

where $H_{\theta^{(i)},\ell}$ is the head subnetwork up to layer $\ell$ of $A_i$ , $T_{\theta^{(j)}, m}$ is the tail subnetwork from layer $m+1$ of $A_j$ 0 to its output, and $A_j$ 1 is the stitching layer bridging their respective activations (Pan et al., 2023).

StitchNet (Teerapittayanon et al., 2023) further generalizes this concept, allowing composite chains $A_j$ 2, assembling sequences of fragments potentially from multiple distinct model families.

Cross-stitch networks, a form of SN-Net in the multi-task setting, employ "cross-stitch units" that train a linear, per-channel mixing of activations between task-specific subnetworks, enabling a continuum between shared and independent representations (Misra et al., 2016).

2. Stitching Layers: Mathematical Formulation and Initialization

The stitching layer $A_j$ 3 is typically a linear transformation:

For convolutional features: a $A_j$ 4 convolution $A_j$ 5 mapping the local feature vectors from head $A_j$ 6 at layer $A_j$ 7 to tail $A_j$ 8 at layer $A_j$ 9 (Pan et al., 2023, Pan et al., 2023).
For fully connected features: a dense linear map.
For sequence models (e.g. Transformers): a linear adapter between token feature spaces.

Initialization is critical for compatibility:

The standard approach uses least-squares fitting on a small sample batch, solving $f^{(i)}_1,\dots,f^{(i)}_{L_i}$ 0 where $f^{(i)}_1,\dots,f^{(i)}_{L_i}$ 1, $f^{(i)}_1,\dots,f^{(i)}_{L_i}$ 2 are paired activations from anchors at the relevant layers (Pan et al., 2023, Pan et al., 2023).
In neuroevolution and cross-model stitching, Kaiming initialization or zero-initialized biases are used, with subsequent training restoring fidelity to original activations (Guijt et al., 2024).

For multi-task cross-stitch networks, the stitching is implemented via learned mixing matrices $f^{(i)}_1,\dots,f^{(i)}_{L_i}$ 3, with per-channel $f^{(i)}_1,\dots,f^{(i)}_{L_i}$ 4 matrices interpolating between $f^{(i)}_1,\dots,f^{(i)}_{L_i}$ 5 tasks or modalities (Misra et al., 2016).

Recent work introduces low-rank adaptation (LoRA) of stitching matrices, where the update to $f^{(i)}_1,\dots,f^{(i)}_{L_i}$ 6 is decomposed as $f^{(i)}_1,\dots,f^{(i)}_{L_i}$ 7 (with $f^{(i)}_1,\dots,f^{(i)}_{L_i}$ 8, $f^{(i)}_1,\dots,f^{(i)}_{L_i}$ 9), reducing memory and increasing regularization for downstream adaptation (Pan et al., 2023, He et al., 2023).

3. Selection of Stitch Points and Assembly Algorithms

Choosing optimal cut-locations (stitch points) is central. Model compatibility is estimated using Centered Kernel Alignment (CKA):

Given flattened activations $F_{i \rightarrow j, (\ell, m)}$ 0 and $F_{i \rightarrow j, (\ell, m)}$ 1 for a batch of $F_{i \rightarrow j, (\ell, m)}$ 2 samples:

$F_{i \rightarrow j, (\ell, m)}$ 3

where $F_{i \rightarrow j, (\ell, m)}$ 4, $F_{i \rightarrow j, (\ell, m)}$ 5, $F_{i \rightarrow j, (\ell, m)}$ 6, $F_{i \rightarrow j, (\ell, m)}$ 7 (Teerapittayanon et al., 2023, Wang et al., 8 Jun 2025).

The pairwise CKA matrix is computed for all possible cut locations to maximize compatibility under parameter/FLOPs/resource constraints. Algorithmic search includes:

Greedy or recursive breadth/depth-first search retaining high-compatibility choices (Teerapittayanon et al., 2023).
CKA-guided selection under budget constraints ( $F_{i \rightarrow j, (\ell, m)}$ 8) (Wang et al., 8 Jun 2025).
Fragment chains/ensembles (StitchNet) via dynamic programming on compatibility matrices (Teerapittayanon et al., 2023).

In SN-Netv2, the stitching space is enlarged by supporting two-way (fast→slow, slow→fast, and multi-stage) traversals and resource-constrained sampling, to improve coverage across the FLOPs–accuracy spectrum (Pan et al., 2023).

4. Training Paradigms and Adaptation Strategies

SN-Net training regimes are task- and complexity-dependent:

Partial fine-tuning: Only the parameters of stitching layers are updated; all anchor subnetworks remain frozen, enabling rapid adaptation and preserving pre-trained features (Wang et al., 8 Jun 2025).
Full fine-tuning: All parameters—including the anchors and stitches—are updated; this yields superior performance for complex tasks and dense predictions (Pan et al., 2023).
LoRA/Parameter-efficient approaches: Only low-rank updates and stitch-specific biases are trained, drastically reducing memory and storage overhead (He et al., 2023, Pan et al., 2023).
Task-adaptive sampling: Stitch instances likely to fall on the Pareto frontier of accuracy–efficiency are sampled more frequently, using SNIP-based gradient saliency tracking to boost efficient coverage of the deployment space (He et al., 2023).

Loss functions are context-dependent:

For standard classification, cross-entropy plus optional distillation from a strong teacher (Pan et al., 2023).
For stitching two networks, matching is typically by mean squared error between stitched output and original target activations (Guijt et al., 2024, Guijt et al., 19 Dec 2025).

5. Experimental Results and Empirical Trade-offs

Explicit experiments highlight SN-Net’s efficacy in traversing the resource–accuracy trade-off curve:

ImageNet-1K classification: Stitching DeiT-Ti/S/B anchors yields a near-linear, interpolated FLOPs vs. Top-1 accuracy frontier, strictly covering the range between smallest and largest anchors (Pan et al., 2023, Pan et al., 2023).
Flexible Pareto improvement: On semantic segmentation and depth estimation tasks (ADE20K, COCO-Stuff-10K, NYUv2), stitching adapters support smooth interpolation, sometimes exceeding individual anchor performance at certain budgets (Pan et al., 2023).
On-the-fly personalization: Construction of accurate task-specific models using minimal data and a fragment pool—achieving up to 95% accuracy on binary classification tasks with a 90% reduction in compute and required examples compared to fine-tuning (Teerapittayanon et al., 2023).
Federated/asynchronous scenarios: SN-Net adapters between separately trained models on medical datasets close ~80% of the generalization gap toward a central “merge” model, with as little as 5–10% compute overhead over simple ensembles (Guijt et al., 19 Dec 2025).

6. Extensions: Multi-architecture, Neuroevolution, and Efficient Adaptation

SN-Net’s core paradigm has been extended in several key directions:

Heterogeneous stitching: Bridging CNN–Transformer or different architectural families using compatible adapters (e.g., 2D convolution followed by linear projection), allowing for broad model fusion (Wang et al., 8 Jun 2025, Pan et al., 2023).
Neuroevolution: Assembly of stitched “supernetworks” from parental graphs using acyclic matchings, enabling efficient offspring extraction and parallel Pareto optimization in (accuracy, compute) space; offspring can outperform both parents in key metrics (Guijt et al., 2024).
Efficient task adaptation (ESTA): Application of LoRA PEFT, stitch-agnostic updates, and resource-aware stitch sampling enables drastic reduction in fine-tuning GPU-hours (e.g., 5.0 h ESTA vs. 19.3 h SN-Net for 25 tasks), memory (9.7 GB vs. 13.2 GB), and trainable parameters (4.6 M vs. 124.2 M); adaptation to LLMs further demonstrates domain generality (He et al., 2023).

7. Limitations, Open Challenges, and Future Directions

Compatibility constraints: Effective stitching requires compatible feature sizes and architectural motifs. Dissimilar anchors may require more expressive adapters or stages of domain alignment (Teerapittayanon et al., 2023, Guijt et al., 19 Dec 2025).
Sampling/scheduling: Uniform sampling across many possible stitch points under-trains rare or extreme stitches. Balanced or importance-weighted sampling schemes (e.g., ROS in SN-Netv2) are critical for full coverage (Pan et al., 2023).
Storage/computation overhead: Storage is reduced compared to storing many independent models, but for large stitches or multi-LLM scenarios, memory requirements remain challenging without PEFT techniques (He et al., 2023).
Theoretical understanding: While feature re-alignment via linear adapters is empirically effective, the theory underlying inter-anchor representation compatibility and the limits of linear reparameterization merit further analysis (Teerapittayanon et al., 2023).
Emerging extensions: Nonlinear, spatially-varying, or hierarchical stitching layers, as well as more advanced fine-tuning and distillation routines, are prospective research directions (Misra et al., 2016, He et al., 2023).

Stitchable Neural Networks constitute a scalable, data-efficient, and highly elastic approach for leveraging the proliferating zoo of pre-trained models, enabling practitioners to rapidly generate models matched to dynamic resource, latency, or accuracy demands in both classical and emerging deployment contexts.