PIDformer: Transformer Meets Control Theory (2402.15989v1)
Abstract: In this work, we address two main shortcomings of transformer architectures: input corruption and rank collapse in their output representation. We unveil self-attention as an autonomous state-space model that inherently promotes smoothness in its solutions, leading to lower-rank outputs and diminished representation capacity. Moreover, the steady-state solution of the model is sensitive to input perturbations. We incorporate a Proportional-Integral-Derivative (PID) closed-loop feedback control system with a reference point into the model to improve robustness and representation capacity. This integration aims to preserve high-frequency details while bolstering model stability, rendering it more noise-resilient. The resulting controlled state-space model is theoretically proven robust and adept at addressing the rank collapse. Motivated by this control framework, we derive a novel class of transformers, PID-controlled Transformer (PIDformer), aimed at improving robustness and mitigating the rank-collapse issue inherent in softmax transformers. We empirically evaluate the model for advantages and robustness against baseline transformers across various practical tasks, including object classification, image segmentation, and LLMing.
- Character-level language modeling with deeper self-attention. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 3159–3166, 2019.
- Mathematics of Data Science. 2020. URL https://people.math.ethz.ch/~abandeira/BandeiraSingerStrohmer-MDS-draft.pdf.
- Understanding robustness of transformers for image classification. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 10231–10241, 2021.
- Decision transformer: Reinforcement learning via sequence modeling. Advances in neural information processing systems, 34:15084–15097, 2021.
- Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
- Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1724–1734, Doha, Qatar, October 2014. Association for Computational Linguistics. doi: 10.3115/v1/D14-1179. URL https://www.aclweb.org/anthology/D14-1179.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Ieee, 2009.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Benchmarking adversarial robustness on image classification. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 321–331, 2020.
- Attention is not all you need: Pure attention loses rank doubly exponentially with depth. In International Conference on Machine Learning, pp. 2793–2803. PMLR, 2021.
- An image is worth 16x16 words: Transformers for image recognition at scale. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021a. URL https://openreview.net/forum?id=YicbFdNTTy.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021b. URL https://openreview.net/forum?id=YicbFdNTTy.
- Nonlocal linear image regularization and supervised segmentation. Multiscale Model. Simul., 6:595–630, 2007.
- Nonlocal operators with applications to image processing. Multiscale Model. Simul., 7:1005–1028, 2008.
- Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100, 2020.
- Pct: Point cloud transformer. Computational Visual Media, 7(2):187–199, 2021.
- Benchmarking neural network robustness to common corruptions and perturbations. arXiv preprint arXiv:1903.12261, 2019.
- The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8340–8349, 2021a.
- Natural adversarial examples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15262–15271, 2021b.
- Offline reinforcement learning as one big sequence modeling problem. Advances in neural information processing systems, 34:1273–1286, 2021.
- Is bert really robust? a strong baseline for natural language attack on text classification and entailment. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp. 8018–8025, 2020.
- Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
- On the robustness of vision transformers to adversarial examples. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7838–7847, 2021.
- Towards robust vision transformer. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pp. 12042–12051, 2022.
- Pointer sentinel mixture models. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017. URL https://openreview.net/forum?id=Byj72udxe.
- The monte carlo method. Journal of the American Statistical Association, 44(247):335–341, 1949. ISSN 01621459. URL http://www.jstor.org/stable/2280232.
- Morača, N. Upper bounds for the infinity norm of the inverse of sdd and s-sdd matrices. Journal of Computational and Applied Mathematics, 206(2):666–678, 2007. ISSN 0377-0427. doi: https://doi.org/10.1016/j.cam.2006.08.013. URL https://www.sciencedirect.com/science/article/pii/S0377042706005139.
- A decomposable attention model for natural language inference. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2249–2255, Austin, Texas, November 2016. Association for Computational Linguistics. doi: 10.18653/v1/D16-1244. URL https://www.aclweb.org/anthology/D16-1244.
- Vision transformers are robust learners. In Proceedings of the AAAI conference on Artificial Intelligence, volume 36, pp. 2071–2081, 2022.
- Invariant language modeling. In Goldberg, Y., Kozareva, Z., and Zhang, Y. (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 5728–5743, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.387. URL https://aclanthology.org/2022.emnlp-main.387.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020. URL http://jmlr.org/papers/v21/20-074.html.
- Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
- Linear transformers are secretly fast weight programmers. In International Conference on Machine Learning, pp. 9355–9366. PMLR, 2021.
- Revisiting over-smoothing in BERT from the perspective of graph. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=dUV91uaXm3.
- Silvester, J. R. Determinants of block matrices. The Mathematical Gazette, 84(501):460–467, 2000. ISSN 00255572. URL http://www.jstor.org/stable/3620776.
- Strang, G. Linear algebra and its applications. Thomson, Brooks/Cole, Belmont, CA, 2006. ISBN 0030105676 9780030105678 0534422004 9780534422004. URL http://www.amazon.com/Linear-Algebra-Its-Applications-Edition/dp/0030105676.
- Segmenter: Transformer for semantic segmentation. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 7262–7272, 2021.
- Training data-efficient image transformers distillation through attention. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 10347–10357. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/touvron21a.html.
- Adversarial training and robustness for multiple perturbations. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019a. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/5d4ae76f053f8f2516ad12961ef7fe97-Paper.pdf.
- Adversarial training and robustness for multiple perturbations. Advances in neural information processing systems, 32, 2019b.
- Adversarial risk and the dangers of evaluating against weak attacks. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 5025–5034. PMLR, 10–15 Jul 2018. URL https://proceedings.mlr.press/v80/uesato18a.html.
- Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008, 2017.
- Infobert: Improving robustness of language models from an information theoretic perspective. 2021. Publisher Copyright: © 2021 ICLR 2021 - 9th International Conference on Learning Representations. All rights reserved.; 9th International Conference on Learning Representations, ICLR 2021 ; Conference date: 03-05-2021 Through 07-05-2021.
- Anti-oversmoothing in deep vision transformers via the fourier domain analysis: From theory to practice. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=O476oWmiNNp.
- Robust machine comprehension models via adversarial training. In Walker, M., Ji, H., and Stent, A. (eds.), Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 575–581, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-2091. URL https://aclanthology.org/N18-2091.
- Nyströmformer: A Nyström-based Algorithm for Approximating Self-Attention. 2021.
- Bregman iterative algorithms for l(1)-minimization with applications to compressed sensing. Siam Journal on Imaging Sciences - SIAM J IMAGING SCI, 1, 01 2008. doi: 10.1137/070703983.
- You are catching my attention: Are vision transformers bad learners under backdoor attacks? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 24605–24615, 2023.
- Word-level textual adversarial attacking as combinatorial optimization. arXiv preprint arXiv:1910.12196, 2019.
- Deep learning based recommender system: A survey and new perspectives. ACM Computing Surveys (CSUR), 52(1):1–38, 2019.
- Bregmanized nonlocal regularization for deconvolution and sparse reconstruction. SIAM Journal on Imaging Sciences, 3(3):253–276, 2010. doi: 10.1137/090746379. URL https://doi.org/10.1137/090746379.
- Point transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16259–16268, 2021.
- Scene parsing through ade20k dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 633–641, 2017.
- Semantic understanding of scenes through the ade20k dataset, 2018.
- Deepvit: Towards deeper vision transformer, 2021.
- Understanding the robustness in vision transformers. In International Conference on Machine Learning, pp. 27378–27394. PMLR, 2022.
- Tam Nguyen (18 papers)
- César A. Uribe (75 papers)
- Tan M. Nguyen (26 papers)
- Richard G. Baraniuk (141 papers)