Popeye: A Unified Visual-Language Model for Multi-Source Ship Detection from Remote Sensing Imagery (2403.03790v2)
Abstract: Ship detection needs to identify ship locations from remote sensing (RS) scenes. Due to different imaging payloads, various appearances of ships, and complicated background interference from the bird's eye view, it is difficult to set up a unified paradigm for achieving multi-source ship detection. To address this challenge, in this article, leveraging the LLMs' powerful generalization ability, a unified visual-LLM called Popeye is proposed for multi-source ship detection from RS imagery. Specifically, to bridge the interpretation gap between the multi-source images for ship detection, a novel unified labeling paradigm is designed to integrate different visual modalities and the various ship detection ways, i.e., horizontal bounding box (HBB) and oriented bounding box (OBB). Subsequently, the hybrid experts encoder is designed to refine multi-scale visual features, thereby enhancing visual perception. Then, a visual-language alignment method is developed for Popeye to enhance interactive comprehension ability between visual and language content. Furthermore, an instruction adaption mechanism is proposed for transferring the pre-trained visual-language knowledge from the nature scene into the RS domain for multi-source ship detection. In addition, the segment anything model (SAM) is also seamlessly integrated into the proposed Popeye to achieve pixel-level ship segmentation without additional training costs. Finally, extensive experiments are conducted on the newly constructed ship instruction dataset named MMShip, and the results indicate that the proposed Popeye outperforms current specialist, open-vocabulary, and other visual-LLMs for zero-shot multi-source ship detection.
- Ship detection and classification from optical remote sensing images: A survey. Chinese Journal of Aeronautics, 34(3):145–163, 2021.
- Dense attention pyramid networks for multi-scale ship detection in sar images. IEEE Transactions on Geoscience and Remote Sensing, 57(11):8983–8997, 2019.
- Ship detection in optical remote sensing images based on saliency and a rotation-invariant descriptor. Remote Sensing, 10(3):400, 2018.
- Intelligent ship detection in remote sensing images based on multi-layer convolutional feature fusion. Remote Sensing, 12(20):3316, 2020.
- Domain adaptive ship detection in optical remote sensing images. Remote Sensing, 13(16):3168, 2021.
- Rotated region based cnn for ship detection. In 2017 IEEE International Conference on Image Processing (ICIP), pages 900–904. IEEE, 2017.
- A novel cnn-based method for accurate ship detection in hr optical remote sensing images via rotated bounding box. IEEE Transactions on Geoscience and Remote Sensing, 59(1):686–699, 2020.
- An anchor-free detection method for ship targets in high-resolution sar images. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 14:7799–7816, 2021.
- Sar ship target detection for ssdv2 under complex backgrounds. In 2020 International Conference on Computer Vision, Image and Deep Learning (CVIDL), pages 560–565. IEEE, 2020.
- Rapid ship detection in sar images based on yolov3. In 2020 5th international conference on communication, image and signal processing (CCISP), pages 214–218. IEEE, 2020.
- Attention receptive pyramid network for ship detection in sar images. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 13:2738–2756, 2020.
- Ship detection based on yolov2 for sar imagery. Remote Sensing, 11(7):786, 2019.
- Learning polar encodings for arbitrary-oriented ship detection in sar images. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 14:3846–3859, 2021.
- An anchor-free method for arbitrary-oriented ship detection in sar images. In 2021 SAR in Big Data Era (BIGSARDATA), pages 1–4. IEEE, 2021.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
- OpenAI. Chatgpt. https://chat.openai.com, 2023a.
- OpenAI. Gpt-4 technical report, 2023.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
- Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
- Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023.
- Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010, 2023.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Consecutive pre-training: A knowledge transfer learning strategy with relevant unlabeled data for remote sensing domain. Remote Sensing, 14(22):5675, 2022.
- Segment anything. arXiv preprint arXiv:2304.02643, 2023.
- Improving language understanding by generative pre-training. 2018.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7, 2023.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2023.
- Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277, 2023.
- Visualgpt: Data-efficient adaptation of pretrained language models for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18030–18040, 2022.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022.
- Google. Bard. https://bard.google.com/, 2023.
- mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023.
- Rsgpt: A remote sensing vision language model and benchmark. arXiv preprint arXiv:2307.15266, 2023.
- Geochat: Grounded large vision-language model for remote sensing. arXiv preprint arXiv:2311.15826, 2023.
- Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 580–587, 2014.
- Oriented r-cnn for object detection. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3520–3529, 2021.
- Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015.
- Learning roi transformer for oriented object detection in aerial images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2849–2858, 2019.
- Ao2-detr: Arbitrary-oriented object detection transformer. IEEE Transactions on Circuits and Systems for Video Technology, 2022.
- Clip-vg: Self-paced curriculum adapting of clip via exploiting pseudo-language labels for visual grounding. arXiv preprint arXiv:2305.08685, 2023.
- Rsvg: Exploring data and models for visual grounding on remote sensing data. IEEE Transactions on Geoscience and Remote Sensing, 61:1–13, 2023.
- Fine-grained recognition for oriented ship against complex scenes in optical remote sensing images. IEEE Transactions on Geoscience and Remote Sensing, 60:1–18, 2021.
- Object detection in aerial images: A large-scale benchmark and challenges. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–1, 2021.
- Sar ship detection dataset (ssdd): Official release and comprehensive data analysis. Remote Sensing, 13(18):3690, 2021.
- Hrsid: A high-resolution sar images dataset for ship detection and instance segmentation. Ieee Access, 8:120234–120254, 2020.
- Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
- Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022.
- Rsprompter: Learning to prompt for remote sensing instance segmentation based on visual foundation model. arXiv preprint arXiv:2306.16269, 2023.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Remoteclip: A vision language foundation model for remote sensing. arXiv preprint arXiv:2306.11029, 2023.
- Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
- An open and comprehensive pipeline for unified object grounding and detection. arXiv preprint arXiv:2401.02361, 2024.
- Lenna: Language enhanced reasoning detection assistant, 2023.
- Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
- Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint arXiv:2311.07575, 2023.
- Align deep features for oriented object detection. IEEE Transactions on Geoscience and Remote Sensing, 60:1–11, 2021.
- Beyond bounding-box: Convex-hull feature adaptation for oriented and densely packed object detection. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 8792–8801, 2021.
- Oriented reppoints for aerial object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1829–1838, 2022.
- Shape-adaptive selection and measurement for oriented object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, 2022.
- R3det: Refined single-stage detector with feature refinement for rotating object. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 3163–3171, 2021.