- The paper presents a method that decouples training from inference by progressively replacing network components while preserving performance.
- It uses representational similarity metrics to align new modules with original activations, enabling flexible architectural transformations such as CNN-to-MLP conversions.
- Experimental results on ImageNet and Wikitext-103 demonstrate significant performance retention, paving the way for deployment-centric neural network designs.
Network of Theseus: Decoupling Training from Inference in Neural Networks
Introduction
The conventional approach in deep learning dictates that the neural network architecture used during training must remain the same during inference, maintaining consistent inductive biases throughout the process. This assumption limits the flexibility to adopt architectures with superior efficiency or design properties that are challenging to optimize. The paper "Network of Theseus" (2512.04198) introduces a novel methodology known as Network of Theseus (NoT), which challenges this conventional premise. NoT enables the progressive transformation of a neural network's architecture—either pre-trained or untrained—into a different target architecture by systematically replacing its components while preserving the network's performance. This transformative approach leverages representational similarity metrics to align transformed components with their original counterparts, facilitating substantial architectural shifts, such as transitioning from a convolutional network to a multilayer perceptron or from GPT-2 to an RNN. By disconnecting optimization from deployment, NoT paves the way for broader exploration of architectural design spaces, potentially unlocking better accuracy-efficiency trade-offs.
Methodology
NoT operates by incrementally replacing elements of a guide network with components from a target architecture. At each stage of this transformation, the new modules are optimized to align with the original network's activations using representational similarity metrics. This alignment process is primarily conducted at the layer level, ensuring that the transformation maintains the guide network's performance. The ultimate goal is to complete the conversion of the guide network into a target architecture, which is then fine-tuned on the desired downstream tasks. The paper demonstrates the capability of NoT to facilitate architectural transformations such as converting CNN layers into fully connected MLPs, adapting vision transformers into token-wise MLPs, or reshaping transformer architectures into RNNs, showcasing that even untrained guide networks can offer significant inductive biases useful for the conversion process.
Experimental Results
In empirical evaluations, NoT demonstrated significant improvements over naive architectural replacement strategies, maintaining substantial portions of the guide network’s original performance despite drastic changes in network architecture. These experiments were conducted across various challenging tasks, including image classification and language modeling, using datasets like ImageNet and Wikitext-103. Notably, replacing ResNet-18 convolutional layers with linear layers resulting in MLP structures showed a large performance increase, nearly bridging the gap to the original network's capabilities. Similarly, transformations from DINOv2 models to Patch-MLPs and from GPT-2 to RNNs preserved a significant portion of the original architectures' accuracy, highlighting NoT’s robustness and versatility.
Implications and Future Directions
The implications of NoT are profound, both theoretically and practically. Practically, NoT provides an avenue for developing inference architectures optimized for specific deployment scenarios without resetting the training process for each design iteration. Theoretically, it challenges the deeply ingrained notion that architecture must remain static post-training, suggesting that optimization-centric design may be replaced or supplemented by deployment-focused design strategies. Future research avenues could include exploring optimal architectural selections for guide and target networks tailored toward specific optimization and deployment efficiencies, and further refining the methods to automate and generalize the transformation process across broader architecture families.
Conclusion
Network of Theseus represents a pivotal shift in how we understand and utilize neural network architectures, breaking the conventional bonds between training and inference configurations. This flexibility encourages a reimagining of architectural design principles, promoting efficiency and innovation in neural network deployment. While promising, the authors note limitations, including the resource-intensive nature of their methodology and the need for refined strategies to manage computational costs. Nonetheless, NoT's foundational shift in perspective is an invaluable stride toward more flexible, efficient, and specialist AI systems. This work lays the groundwork for future innovations in neural architecture design and deployment strategies.