PAD: Self-Supervised Pre-Training with Patchwise-Scale Adapter for Infrared Images (2312.08192v1)
Abstract: Self-supervised learning (SSL) for RGB images has achieved significant success, yet there is still limited research on SSL for infrared images, primarily due to three prominent challenges: 1) the lack of a suitable large-scale infrared pre-training dataset, 2) the distinctiveness of non-iconic infrared images rendering common pre-training tasks like masked image modeling (MIM) less effective, and 3) the scarcity of fine-grained textures making it particularly challenging to learn general image features. To address these issues, we construct a Multi-Scene Infrared Pre-training (MSIP) dataset comprising 178,756 images, and introduce object-sensitive random RoI cropping, an image preprocessing method, to tackle the challenge posed by non-iconic images. To alleviate the impact of weak textures on feature learning, we propose a pre-training paradigm called Pre-training with ADapter (PAD), which uses adapters to learn domain-specific features while freezing parameters pre-trained on ImageNet to retain the general feature extraction capability. This new paradigm is applicable to any transformer-based SSL method. Furthermore, to achieve more flexible coordination between pre-trained and newly-learned features in different layers and patches, a patchwise-scale adapter with dynamically learnable scale factors is introduced. Extensive experiments on three downstream tasks show that PAD, with only 1.23M pre-trainable parameters, outperforms other baseline paradigms including continual full pre-training on MSIP. Our code and dataset are available at https://github.com/casiatao/PAD.
- Abien Fred Agarap. Deep learning using rectified linear units (relu). arXiv preprint arXiv:1803.08375, 2018.
- Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Multimae: Multi-modal multi-task masked autoencoders. In Computer Vision – ECCV 2022, pages 348–367, Cham, 2022. Springer Nature Switzerland.
- BEit: BERT pre-training of image transformers. In International Conference on Learning Representations, 2022.
- Birdsai: A dataset for detection and tracking in aerial thermal infrared videos. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2020.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems, pages 1877–1901. Curran Associates, Inc., 2020.
- Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9650–9660, 2021.
- Pimae: Point cloud and image interactive masked autoencoders for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5291–5301, 2023a.
- Generative pretraining from pixels. In Proceedings of the 37th International Conference on Machine Learning, pages 1691–1703. PMLR, 2020.
- Adaptformer: Adapting vision transformers for scalable visual recognition. In Advances in Neural Information Processing Systems, pages 16664–16678. Curran Associates, Inc., 2022.
- An empirical study of training self-supervised vision transformers. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 9620–9629, 2021.
- Near-infrared luminescence high-contrast in vivo biomedical imaging. Nature Reviews Bioengineering, 1(1):60–78, 2023b.
- Vision transformer adapter for dense predictions. In The Eleventh International Conference on Learning Representations, 2023c.
- Electra: Pre-training text encoders as discriminators rather than generators. In International Conference on Learning Representations, 2020.
- Satmae: Pre-training transformers for temporal and multi-spectral satellite imagery. In Advances in Neural Information Processing Systems, pages 197–211. Curran Associates, Inc., 2022.
- MMSegmentation Contributors. MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark. https://github.com/open-mmlab/mmsegmentation, 2020.
- MMSelfSup Contributors. MMSelfSup: Openmmlab self-supervised learning toolbox and benchmark. https://github.com/open-mmlab/mmselfsup, 2021.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
- Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
- Bootstrap your own latent - a new approach to self-supervised learning. In Advances in Neural Information Processing Systems, pages 21271–21284. Curran Associates, Inc., 2020.
- Mfnet: Towards real-time semantic segmentation for autonomous vehicles with multi-spectral scenes. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5108–5115, 2017.
- Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015.
- Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16000–16009, 2022.
- Milan: Masked image pretraining on language assisted representation. arXiv preprint arXiv:2208.06049, 2022.
- LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022.
- Llvip: A visible-infrared paired dataset for low-light vision. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, pages 3496–3504, 2021.
- Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186, 2019.
- Object detection, recognition, and tracking from uavs using a thermal camera. Journal of Field Robotics, 38(2):242–267, 2021.
- Infrared ship database. Online, 2021. http://openai.raytrontek.com/apply/E_Sea_shipping.html/.
- Segmenting objects in day and night: Edge-conditioned cnn for thermal image semantic segmentation. IEEE Transactions on Neural Networks and Learning Systems, 32(7):3069–3082, 2021.
- Lasher: A large-scale high-diversity benchmark for rgbt tracking. IEEE Transactions on Image Processing, 31:392–404, 2022a.
- Exploring plain vision transformer backbones for object detection. In Computer Vision – ECCV 2022, pages 280–296, Cham, 2022b. Springer Nature Switzerland.
- Univip: A unified framework for self-supervised visual pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14627–14636, 2022c.
- Microsoft coco: Common objects in context. In Computer Vision – ECCV 2014, pages 740–755, Cham, 2014. Springer International Publishing.
- Lsotb-tir: A large-scale high-diversity thermal infrared object tracking benchmark. In Proceedings of the 28th ACM International Conference on Multimedia, page 3847–3856, New York, NY, USA, 2020a. Association for Computing Machinery.
- Infrared security database. Online, 2021a. http://openai.raytrontek.com/apply/E_Infrared_security.html/.
- Infrared aerial photography database. Online, 2021b. http://openai.raytrontek.com/apply/E_Aerial_mancar.html/.
- Self-emd: Self-supervised object detection without imagenet. arXiv preprint arXiv:2011.13677, 2020b.
- SGDR: Stochastic gradient descent with warm restarts. In International Conference on Learning Representations, 2017.
- Decoupled weight decay regularization. In International Conference on Learning Representations, 2019.
- Sstn: Self-supervised domain adaptation thermal object detection for autonomous driving. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 206–213. IEEE, 2021.
- R-MAE: Regions meet masked autoencoders. arXiv preprint arXiv:2306.05411, 2023.
- Ha-Thanh Nguyen. A brief report on lawgpt 1.0: A virtual legal assistant based on gpt-3. arXiv preprint arXiv:2302.05729, 2023.
- Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2019.
- Improving language understanding by generative pre-training. 2018.
- Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
- Drone-based rgb-infrared cross-modality vehicle detection via uncertainty-aware learning. IEEE Transactions on Circuits and Systems for Video Technology, 32(10):6700–6713, 2022.
- TELEDYNE FLIR Team. FLIR thermal dataset for algorithm training. Online, 2019. https://www.flir.com/oem/adas/adas-dataset-form/.
- Training data-efficient image transformers & distillation through attention. In Proceedings of the 38th International Conference on Machine Learning, pages 10347–10357. PMLR, 2021.
- Selective Search for Object Recognition. International Journal of Computer Vision, 104(2):154–171, 2013.
- Attention is all you need. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2017.
- Masked feature prediction for self-supervised visual pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14668–14678, 2022.
- Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), 2018.
- Bloomberggpt: A large language model for finance. arXiv preprint arXiv:2303.17564, 2023.
- Detectron2. https://github.com/facebookresearch/detectron2, 2019.
- Unified perceptual parsing for scene understanding. In Proceedings of the European Conference on Computer Vision (ECCV), 2018.
- Masked frequency modeling for self-supervised visual pre-training. In The Eleventh International Conference on Learning Representations, 2023.
- Simmim: A simple framework for masked image modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9653–9663, 2022.
- Benchmarking a large-scale fir dataset for on-road pedestrian detection. Infrared Physics & Technology, 96:199–208, 2019.
- Improving visual prompt tuning for self-supervised vision transformers. In Proceedings of the 40th International Conference on Machine Learning. JMLR.org, 2023.