CST: Calibration Side-Tuning for Parameter and Memory Efficient Transfer Learning (2402.12736v1)
Abstract: Achieving a universally high accuracy in object detection is quite challenging, and the mainstream focus in the industry currently lies on detecting specific classes of objects. However, deploying one or multiple object detection networks requires a certain amount of GPU memory for training and storage capacity for inference. This presents challenges in terms of how to effectively coordinate multiple object detection tasks under resource-constrained conditions. This paper introduces a lightweight fine-tuning strategy called Calibration side tuning, which integrates aspects of adapter tuning and side tuning to adapt the successful techniques employed in transformers for use with ResNet. The Calibration side tuning architecture that incorporates maximal transition calibration, utilizing a small number of additional parameters to enhance network performance while maintaining a smooth training process. Furthermore, this paper has conducted an analysis on multiple fine-tuning strategies and have implemented their application within ResNet, thereby expanding the research on fine-tuning strategies for object detection networks. Besides, this paper carried out extensive experiments using five benchmark datasets. The experimental results demonstrated that this method outperforms other compared state-of-the-art techniques, and a better balance between the complexity and performance of the finetune schemes is achieved.
- Gaia: A transfer learning system of object detection that fits your needs, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 274–283.
- End-to-end object detection with transformers, in: European conference on computer vision, Springer. pp. 213–229.
- Mmdetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155 .
- Meta-tuning loss functions and data augmentation for few-shot object detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7339–7349.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 .
- The pascal visual object classes (voc) challenge. International journal of computer vision 88, 303–338.
- Generalized few-shot object detection without forgetting, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4527–4536.
- Discriminative fisher embedding dictionary transfer learning for object recognition. IEEE Transactions on Neural Networks and Learning Systems .
- A survey of quantization methods for efficient neural network inference, in: Low-Power Computer Vision. Chapman and Hall/CRC, pp. 291–326.
- Rich feature hierarchies for accurate object detection and semantic segmentation, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580–587.
- Parameter-efficient transfer learning with diff pruning. arXiv preprint arXiv:2012.07463 .
- Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778.
- Parameter-efficient transfer learning for nlp, in: International Conference on Machine Learning, PMLR. pp. 2790–2799.
- Searching for mobilenetv3, in: Proceedings of the IEEE/CVF international conference on computer vision, pp. 1314–1324.
- Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 .
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 .
- Cross-domain weakly-supervised object detection through progressive domain adaptation, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5001–5009.
- Visual prompt tuning, in: European Conference on Computer Vision, Springer. pp. 709–727.
- Label, verify, correct: A simple few shot object detection method, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 14237–14247.
- Afrocentric (african) crop dataset. https://www.kaggle.com/datasets/responsibleailab/crop-disease-ghana/data. Accessed: 1 20, 2024.
- The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691 .
- Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710 .
- Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190 .
- Cross-people mobile-phone based airwriting character recognition, in: 2020 25th International Conference on Pattern Recognition (ICPR), IEEE. pp. 3027–3033.
- Microsoft coco: Common objects in context, in: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, Springer. pp. 740–755.
- Improving convolutional networks with self-calibrated convolutions, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10096–10105.
- Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys 55, 1–35.
- Ssd: Single shot multibox detector, in: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, Springer. pp. 21–37.
- Multiclass confidence and localization calibration for object detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19734–19743.
- You only look once: Unified, real-time object detection, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788.
- Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems 28.
- Mobilenetv2: Inverted residuals and linear bottlenecks, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4510–4520.
- Divide-and-conquer checkpointing for arbitrary programs with no user annotation. Optimization Methods and Software 33, 1288–1330.
- Sparse r-cnn: End-to-end object detection with learnable proposals, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 14454–14463.
- Lst: Ladder side-tuning for parameter and memory efficient transfer learning. Advances in Neural Information Processing Systems 35, 12991–13005.
- Vl-adapter: Parameter-efficient transfer learning for vision-and-language tasks, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5227–5237.
- Training neural networks with fixed sparse masks. Advances in Neural Information Processing Systems 34, 24193–24205.
- Qbox: Partial transfer learning with active querying for object detection. IEEE Transactions on Neural Networks and Learning Systems .
- Cut and learn for unsupervised object detection and instance segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3124–3134.
- Frustratingly simple few-shot object detection. arXiv preprint arXiv:2003.06957 .
- Detecting everything in the open world: Towards universal object detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11433–11443.
- Multi-scale positive sample refinement for few-shot object detection, in: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVI 16, Springer. pp. 456–472.
- Wider face: A face detection benchmark, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5525–5533.
- Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. arXiv preprint arXiv:2106.10199 .
- Confidence-aware multi-teacher knowledge distillation, in: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE. pp. 4498–4502.
- Side-tuning: a baseline for network adaptation via additive side networks, in: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16, Springer. pp. 698–714.
- Delving into the effectiveness of receptive fields: Learning scale-transferrable architectures for practical object detection. International Journal of Computer Vision 130, 970–989.
- Representation learning for visual object tracking by masked appearance transfer, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18696–18705.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.