Tangent Transformers for Composition, Privacy and Removal (2307.08122v3)
Abstract: We introduce Tangent Attention Fine-Tuning (TAFT), a method for fine-tuning linearized transformers obtained by computing a First-order Taylor Expansion around a pre-trained initialization. We show that the Jacobian-Vector Product resulting from linearization can be computed efficiently in a single forward pass, reducing training and inference cost to the same order of magnitude as its original non-linear counterpart, while using the same number of parameters. Furthermore, we show that, when applied to various downstream visual classification tasks, the resulting Tangent Transformer fine-tuned with TAFT can perform comparably with fine-tuning the original non-linear network. Since Tangent Transformers are linear with respect to the new set of weights, and the resulting fine-tuning loss is convex, we show that TAFT enjoys several advantages compared to non-linear fine-tuning when it comes to model composition, parallel training, machine unlearning, and differential privacy. Our code is available at: https://github.com/tianyu139/tangent-model-composition
- Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, pp. 308–318, 2016.
- Lqf: Linear quadratic fine-tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15729–15739, 2021.
- Ai model disgorgement: Methods and choices. arXiv preprint arXiv:2304.03545, 2023.
- Private empirical risk minimization: Efficient algorithms and tight error bounds. In 2014 IEEE 55th annual symposium on foundations of computer science, pp. 464–473. IEEE, 2014.
- Differentially private stochastic optimization: New results in convex and non-convex settings. Advances in Neural Information Processing Systems, 34:9317–9329, 2021.
- Machine unlearning. In 2021 IEEE Symposium on Security and Privacy (SP), pp. 141–159. IEEE, 2021.
- \\\backslash\a-la-carte prompt tuning (apt): Combining distinct data via composable prompting. arXiv preprint arXiv:2302.07994, 2023.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Differentially private optimization on large model at small cost. arXiv preprint arXiv:2210.00038, 2022a.
- Differentially private bias-term only fine-tuning of foundation models. arXiv preprint arXiv:2210.00036, 2022b.
- Fusing finetuned models for better pretraining. arXiv preprint arXiv:2204.03044, 2022.
- Describing textures in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3606–3613, 2014.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Safe: Machine unlearning with shard graphs. arXiv preprint arXiv:2304.13169, 2023.
- The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science, 9(3–4):211–407, 2014.
- Improved convergence of differential private sgd with gradient clipping. In The Eleventh International Conference on Learning Representations, 2023.
- Eternal sunshine of the spotless net: Selective forgetting in deep networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9304–9312, 2020a.
- Forgetting outside the box: Scrubbing deep networks of information accessible from input-output observations. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIX 16, pp. 383–398. Springer, 2020b.
- Mixed-privacy forgetting in deep networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 792–801, 2021.
- Mixed differential privacy in computer vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8376–8386, 2022.
- Time matters in regularizing deep networks: Weight decay and data augmentation affect early learning dynamics, matter little near convergence. Advances in Neural Information Processing Systems, 32, 2019.
- Caltech-256 object category dataset. 2007.
- Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pp. 2790–2799. PMLR, 2019.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Evaluation of neural architectures trained with square loss vs cross-entropy in classification tasks. arXiv preprint arXiv:2006.07322, 2020.
- Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
- Novel dataset for fine-grained image categorization: Stanford dogs. In Proc. CVPR workshop on fine-grained visual categorization (FGVC), volume 2. Citeseer, 2011.
- No matter how you slice it: Machine unlearning with sisa comes at the expense of minority classes. In First IEEE Conference on Secure and Trustworthy Machine Learning.
- 3d object representations for fine-grained categorization. In Proceedings of the IEEE international conference on computer vision workshops, pp. 554–561, 2013.
- Rethinking the hyperparameters for fine-tuning. arXiv preprint arXiv:2002.11770, 2020.
- Tangent model composition for ensembling and continual fine-tuning. arXiv preprint arXiv:2307.08114, 2023.
- Integral continual learning along the tangent vector field of tasks. arXiv preprint arXiv:2211.13108, 2022.
- Fine-grained visual classification of aircraft. Technical report, 2013.
- Gradients as features for deep representation learning. arXiv preprint arXiv:2004.05529, 2020.
- Cats and dogs. In 2012 IEEE conference on computer vision and pattern recognition, pp. 3498–3505. IEEE, 2012.
- Recognizing indoor scenes. In 2009 IEEE conference on computer vision and pattern recognition, pp. 413–420. IEEE, 2009.
- A stochastic approximation method. The annals of mathematical statistics, pp. 400–407, 1951.
- Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120, 2013.
- Caltech ucsd birds-200-2011. Technical Report CNS-TR-2011-001, California Institute of Technology, 2011.
- Differentially private empirical risk minimization with non-convex loss functions. In International Conference on Machine Learning, pp. 6526–6535. PMLR, 2019.
- Differentially private sgd with non-smooth losses. Applied and Computational Harmonic Analysis, 56:306–336, 2022a.
- Dualprompt: Complementary prompting for rehearsal-free continual learning. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVI, pp. 631–648. Springer, 2022b.
- Learning to prompt for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 139–149, 2022c.
- lo-fi: distributed fine-tuning without communication. arXiv preprint arXiv:2210.11948, 2022a.
- Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International Conference on Machine Learning, pp. 23965–23998. PMLR, 2022b.
- Normalized/clipped sgd with perturbation for differentially private non-convex optimization. arXiv preprint arXiv:2206.13033, 2022.
- Differentially private fine-tuning of language models. arXiv preprint arXiv:2110.06500, 2021.
- Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. arXiv preprint arXiv:2106.10199, 2021.
- A survey on negative transfer. IEEE/CAA Journal of Automatica Sinica, 2022.
Collections
Sign up for free to add this paper to one or more collections.