Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
153 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PAD: Self-Supervised Pre-Training with Patchwise-Scale Adapter for Infrared Images (2312.08192v1)

Published 13 Dec 2023 in cs.CV

Abstract: Self-supervised learning (SSL) for RGB images has achieved significant success, yet there is still limited research on SSL for infrared images, primarily due to three prominent challenges: 1) the lack of a suitable large-scale infrared pre-training dataset, 2) the distinctiveness of non-iconic infrared images rendering common pre-training tasks like masked image modeling (MIM) less effective, and 3) the scarcity of fine-grained textures making it particularly challenging to learn general image features. To address these issues, we construct a Multi-Scene Infrared Pre-training (MSIP) dataset comprising 178,756 images, and introduce object-sensitive random RoI cropping, an image preprocessing method, to tackle the challenge posed by non-iconic images. To alleviate the impact of weak textures on feature learning, we propose a pre-training paradigm called Pre-training with ADapter (PAD), which uses adapters to learn domain-specific features while freezing parameters pre-trained on ImageNet to retain the general feature extraction capability. This new paradigm is applicable to any transformer-based SSL method. Furthermore, to achieve more flexible coordination between pre-trained and newly-learned features in different layers and patches, a patchwise-scale adapter with dynamically learnable scale factors is introduced. Extensive experiments on three downstream tasks show that PAD, with only 1.23M pre-trainable parameters, outperforms other baseline paradigms including continual full pre-training on MSIP. Our code and dataset are available at https://github.com/casiatao/PAD.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (61)
  1. Abien Fred Agarap. Deep learning using rectified linear units (relu). arXiv preprint arXiv:1803.08375, 2018.
  2. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  3. Multimae: Multi-modal multi-task masked autoencoders. In Computer Vision – ECCV 2022, pages 348–367, Cham, 2022. Springer Nature Switzerland.
  4. BEit: BERT pre-training of image transformers. In International Conference on Learning Representations, 2022.
  5. Birdsai: A dataset for detection and tracking in aerial thermal infrared videos. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2020.
  6. Language models are few-shot learners. In Advances in Neural Information Processing Systems, pages 1877–1901. Curran Associates, Inc., 2020.
  7. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9650–9660, 2021.
  8. Pimae: Point cloud and image interactive masked autoencoders for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5291–5301, 2023a.
  9. Generative pretraining from pixels. In Proceedings of the 37th International Conference on Machine Learning, pages 1691–1703. PMLR, 2020.
  10. Adaptformer: Adapting vision transformers for scalable visual recognition. In Advances in Neural Information Processing Systems, pages 16664–16678. Curran Associates, Inc., 2022.
  11. An empirical study of training self-supervised vision transformers. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 9620–9629, 2021.
  12. Near-infrared luminescence high-contrast in vivo biomedical imaging. Nature Reviews Bioengineering, 1(1):60–78, 2023b.
  13. Vision transformer adapter for dense predictions. In The Eleventh International Conference on Learning Representations, 2023c.
  14. Electra: Pre-training text encoders as discriminators rather than generators. In International Conference on Learning Representations, 2020.
  15. Satmae: Pre-training transformers for temporal and multi-spectral satellite imagery. In Advances in Neural Information Processing Systems, pages 197–211. Curran Associates, Inc., 2022.
  16. MMSegmentation Contributors. MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark. https://github.com/open-mmlab/mmsegmentation, 2020.
  17. MMSelfSup Contributors. MMSelfSup: Openmmlab self-supervised learning toolbox and benchmark. https://github.com/open-mmlab/mmselfsup, 2021.
  18. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
  19. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
  20. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
  21. Bootstrap your own latent - a new approach to self-supervised learning. In Advances in Neural Information Processing Systems, pages 21271–21284. Curran Associates, Inc., 2020.
  22. Mfnet: Towards real-time semantic segmentation for autonomous vehicles with multi-spectral scenes. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5108–5115, 2017.
  23. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015.
  24. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16000–16009, 2022.
  25. Milan: Masked image pretraining on language assisted representation. arXiv preprint arXiv:2208.06049, 2022.
  26. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022.
  27. Llvip: A visible-infrared paired dataset for low-light vision. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, pages 3496–3504, 2021.
  28. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186, 2019.
  29. Object detection, recognition, and tracking from uavs using a thermal camera. Journal of Field Robotics, 38(2):242–267, 2021.
  30. Infrared ship database. Online, 2021. http://openai.raytrontek.com/apply/E_Sea_shipping.html/.
  31. Segmenting objects in day and night: Edge-conditioned cnn for thermal image semantic segmentation. IEEE Transactions on Neural Networks and Learning Systems, 32(7):3069–3082, 2021.
  32. Lasher: A large-scale high-diversity benchmark for rgbt tracking. IEEE Transactions on Image Processing, 31:392–404, 2022a.
  33. Exploring plain vision transformer backbones for object detection. In Computer Vision – ECCV 2022, pages 280–296, Cham, 2022b. Springer Nature Switzerland.
  34. Univip: A unified framework for self-supervised visual pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14627–14636, 2022c.
  35. Microsoft coco: Common objects in context. In Computer Vision – ECCV 2014, pages 740–755, Cham, 2014. Springer International Publishing.
  36. Lsotb-tir: A large-scale high-diversity thermal infrared object tracking benchmark. In Proceedings of the 28th ACM International Conference on Multimedia, page 3847–3856, New York, NY, USA, 2020a. Association for Computing Machinery.
  37. Infrared security database. Online, 2021a. http://openai.raytrontek.com/apply/E_Infrared_security.html/.
  38. Infrared aerial photography database. Online, 2021b. http://openai.raytrontek.com/apply/E_Aerial_mancar.html/.
  39. Self-emd: Self-supervised object detection without imagenet. arXiv preprint arXiv:2011.13677, 2020b.
  40. SGDR: Stochastic gradient descent with warm restarts. In International Conference on Learning Representations, 2017.
  41. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019.
  42. Sstn: Self-supervised domain adaptation thermal object detection for autonomous driving. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 206–213. IEEE, 2021.
  43. R-MAE: Regions meet masked autoencoders. arXiv preprint arXiv:2306.05411, 2023.
  44. Ha-Thanh Nguyen. A brief report on lawgpt 1.0: A virtual legal assistant based on gpt-3. arXiv preprint arXiv:2302.05729, 2023.
  45. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2019.
  46. Improving language understanding by generative pre-training. 2018.
  47. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
  48. Drone-based rgb-infrared cross-modality vehicle detection via uncertainty-aware learning. IEEE Transactions on Circuits and Systems for Video Technology, 32(10):6700–6713, 2022.
  49. TELEDYNE FLIR Team. FLIR thermal dataset for algorithm training. Online, 2019. https://www.flir.com/oem/adas/adas-dataset-form/.
  50. Training data-efficient image transformers & distillation through attention. In Proceedings of the 38th International Conference on Machine Learning, pages 10347–10357. PMLR, 2021.
  51. Selective Search for Object Recognition. International Journal of Computer Vision, 104(2):154–171, 2013.
  52. Attention is all you need. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2017.
  53. Masked feature prediction for self-supervised visual pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14668–14678, 2022.
  54. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), 2018.
  55. Bloomberggpt: A large language model for finance. arXiv preprint arXiv:2303.17564, 2023.
  56. Detectron2. https://github.com/facebookresearch/detectron2, 2019.
  57. Unified perceptual parsing for scene understanding. In Proceedings of the European Conference on Computer Vision (ECCV), 2018.
  58. Masked frequency modeling for self-supervised visual pre-training. In The Eleventh International Conference on Learning Representations, 2023.
  59. Simmim: A simple framework for masked image modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9653–9663, 2022.
  60. Benchmarking a large-scale fir dataset for on-road pedestrian detection. Infrared Physics & Technology, 96:199–208, 2019.
  61. Improving visual prompt tuning for self-supervised vision transformers. In Proceedings of the 40th International Conference on Machine Learning. JMLR.org, 2023.
Citations (2)

Summary

We haven't generated a summary for this paper yet.