Papers
Topics
Authors
Recent
Search
2000 character limit reached

Network of Theseus (NoT)

Updated 6 December 2025
  • Network of Theseus (NoT) is a paradigm for converting and compressing neural architectures by iteratively replacing modules while preserving overall functionality.
  • It employs progressive module replacing and representational alignment techniques, using metrics like CKA and D-MNN to ensure accurate knowledge transfer.
  • NoT enables efficient model deployment and cross-architecture conversion, facilitating compression and inductive bias transfer for various applications.

The Network of Theseus (NoT) paradigm is a set of techniques for converting one neural architecture into another by iteratively replacing individual network modules, such that the overall functionality of the system is preserved throughout the transformation. Drawing inspiration from the Ship of Theseus philosophical problem, NoT aims to methodically swap out all "parts" of a model—either to yield a compressed, more efficient version or even to effect cross-family architectural conversions—while maintaining or closely approximating the original behavior. NoT decouples training-phase inductive bias from deployment-phase efficiency, introducing a general mechanism for modular, curriculum-driven model conversion and knowledge transfer (Xu et al., 2020, Subramaniam et al., 3 Dec 2025).

1. Conceptual Foundations and Motivation

NoT was first instantiated in model compression scenarios, notably in "BERT-of-Theseus: Compressing BERT by Progressive Module Replacing" (Xu et al., 2020), as well as in the more general architecture conversion context described in "Network of Theseus (like the ship)" (Subramaniam et al., 3 Dec 2025). The central problem addressed is the rigidity of standard deep learning pipelines, in which the architecture trained is the one deployed, regardless of its efficiency or inductive biases. By enabling a progressive replacement of modules—either with smaller, more efficient analogues or with fundamentally different architectures—NoT expands the design space for inference-time networks.

Motivations for NoT include:

  • Avoiding additional loss functions typical in knowledge distillation (KD) by relying solely on the task loss or alignment via internal activations.
  • Fostering deep, gradient-level interactions between original ("guide," "predecessor," or "teacher") modules and their replacements ("successor" or "target" modules).
  • Enabling regularization and curriculum learning effects via random or scheduled replacement.

2. Core Algorithms and Schedules

NoT encompasses two primary algorithmic paradigms:

  • Progressive Module Replacing (Compression context (Xu et al., 2020)):
    • Partition the network into nn modules {prd1,…,prdn}\{\mathit{prd}_1,\dots,\mathit{prd}_n\}.
    • For each, construct a dimension-compatible compact successor scci\mathit{scc}_i.
    • During each training step, randomly replace prdi\mathit{prd}_i with scci\mathit{scc}_i with a stage-dependent probability p(t)p(t), increasing pp over time.
    • After full replacement, fine-tune the successor-only network.
  • Representational Alignment (Architecture conversion context (Subramaniam et al., 3 Dec 2025)):
    • Incrementally replace modules in a "guide" network fGf^G with target architecture modules fTf^T, training new modules to align their layerwise activations with those of fGf^G via representational similarity metrics (CKA, D-MNN).
    • Schedule replacements according to a staged or progressive curriculum: at each stage tt, a set of layers ItI_t is jointly aligned and replaced, holding unreplaced layers fixed.
    • After all replacements, perform supervised fine-tuning on the target architecture.

Table: NoT Schedules and Replacement Modes

Schedule Type Description Typical Outcome
Progressive Add (or replace) modules cumulatively Best stability/accuracy
Sequential Replace modules one by one Lower performance
Joint Replace all at once Instability
Independent Replace/retrain each layer in isolation Lower accuracy

3. Representational Similarity and Alignment Metrics

Alignment between intermediate activations is central in cross-architecture NoT. The principal metrics used are:

  • Centered Kernel Alignment (CKA): gCKA(H,HG)=HSIC(K,L)HSIC(K,K)HSIC(L,L)∈[0,1]g_{\mathrm{CKA}}(H, H^G) = \frac{\mathrm{HSIC}(K, L)}{\sqrt{\mathrm{HSIC}(K, K)\mathrm{HSIC}(L, L)}} \in [0,1], where KK, LL are (centered) Gram matrices. CKA is both scale- and orthogonal-invariant, and empirically correlates with semantic similarity and downstream performance.
  • Differentiable Mutual Nearest Neighbors (D-MNN): Measures the overlap between nearest neighbor graphs formed from normalized features Ï•i,ψi\phi_i, \psi_i of the guide and target, and uses KL divergence between induced probability distributions of neighborhood structures.

Both metrics provide a differentiable alignment objective that directly enforces the geometric similarity of internal representations and allow the newly inserted target modules to pick up on the features responsible for the guide's predictive behavior (Subramaniam et al., 3 Dec 2025).

4. Empirical Results and Comparative Analysis

Empirical studies of NoT demonstrate robust preservation of functional performance across both compression and cross-family conversion settings.

BERT-of-Theseus Results (Xu et al., 2020):

  • Compressing 12-layer BERT-Base (110 M) to 6-layer (66 M) successor:
    • Inference speed-up: 1.94× on V100.
    • GLUE Dev Macro: 82.5 (base) → 81.2 (NoT).
    • GLUE Test Macro: 80.0 (base) → 78.6 (NoT).
    • Exceeds vanilla and Patient KD as well as LayerDrop pruning, despite no distillation/auxiliary losses.
    • Training resource: <20 GPU-hours vs 720 GPU-hours for DistilBERT KD.

Cross-architecture NoT Results (Subramaniam et al., 3 Dec 2025):

  • ResNet-18 → MLP (ImageNet CKA): 69.66 (guide) → 62.12 (NoT) vs 33.36 (naive).
  • DINOv2 ViT → Patch-MLP: 81.03 (guide) → 72.56 (NoT).
  • GPT-2 → RNN (Wikitext): CKA 37.50 (guide) → 50.58 (NoT), naive replacement collapses (121.19).
  • Progressive schedule outperforms all alternatives by 3–20 percentage points.
  • Even untrained guides enable significant performance transfer, indicating the inductive bias encapsulated in architectural choice.

5. Practical Applications and Extensions

NoT enables several new and existing architectural manipulations:

  • Compression: Modular replacement with compact, efficient blocks (e.g., pruning, quantization, lighter Transformer alternatives, ShuffleNet, Linformer).
  • Cross-family Conversion: Directly converting convolutional, self-attention, or feedforward architectures to novel targets (e.g., CNN to MLP, GPT to RNN).
  • Edge Deployment: Train with a heavy, or even untrained, but optimization-favorable guide, and convert to a resource-friendly inference model.
  • Inductive Bias Transfer: Impose structural priors (e.g., recurrence, lack of token mixing) after initial training.
  • A plausible implication is that future research may leverage NoT to enable architecture search or automated design pipelines unconstrained by training time optimizability.

However, NoT generally requires input/output dimension compatibility for each replacement pair, which can necessitate adaptor layers for broader module classes, and the replacement curriculum (schedule) is often hand-designed for stability.

6. Limitations and Open Directions

Major constraints include:

  • Computational Resource Consumption: Multi-stage alignment and hyperparameter optimization can demand multiple-month, multi-GPU training budgets (e.g., ∼8 × H100 × 2 months) (Subramaniam et al., 3 Dec 2025).
  • Manual Tuning: Learning rates, number of epochs, and similarity thresholds are extensively tuned per replacement stage.
  • Theoretical Guarantees: Only coarse, Lipschitz-based error accumulation bounds are available; tighter non-asymptotic generalization analyses remain open.
  • Stateful Layers: Components such as BatchNorm require freezing or post-conversion recalibration to preserve inference semantics.

Ongoing work includes the development of EMA-based learning rate and early stopping schemes, exploration of new alignment objectives (e.g., Wasserstein, subspace-matching), and dynamic, possibly learned, replacement schedules. Broader empirical evaluation on more diverse tasks (e.g., segmentation, multimodal modeling) is also a focus, as is the expansion of theory linking representational alignment and downstream generalization.

7. Summary and Significance

The Network of Theseus paradigm introduces a methodologically novel avenue for model conversion, compression, and architectural exploration in deep learning. By progressively aligning or replacing modules—using only the task loss in the compression case or with explicit representational similarity alignment for architectural changes—NoT achieves high accuracy retention even under radical changes to network topology. This approach directly challenges the entrenched paradigm that the architecture used for optimization must be that used for deployment, offering new flexibility in targeting efficiency, inductive bias, or hardware constraints at inference time (Xu et al., 2020, Subramaniam et al., 3 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Network of Theseus (NoT).